Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models

Ye Wang1,   Sipeng Zheng211footnotemark: 1,  Bin Cao2,3, Qianshan Wei4, Qin Jin1, Zongqing Lu5
1Renmin University
2Beijing Academy of Artificial Intelligence
3Institute of Automation, Chinese Academy of Sciences
4Southeast University
5School of Computer Science, Peking University
Equal contribution. Ye Wang works as an intern at BAAICorrespondence to Zongqing Lu <[email protected]>.
Abstract

Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted towards the development of large motion models. Despite some progress, current state-of-the-art works remain far from achieving truly generalist models, largely due to the lack of large-scale, high-quality motion data. To address this, we present MotionBase, the first million-level motion generation benchmark, offering 15 times the data volume of the previous largest dataset, and featuring multimodal data with hierarchically detailed text descriptions. By leveraging this vast dataset, our large motion model demonstrates strong performance across a broad range of motions, including unseen ones. Through systematic investigation, we underscore the importance of scaling both data and model size, with synthetic data and pseudo labels playing a crucial role in mitigating data acquisition costs. Moreover, our research reveals the limitations of existing evaluation metrics, particularly in handling out-of-domain text instructions — an issue that has long been overlooked. In addition to these, we introduce a novel 2D lookup-free approach for motion tokenization, which preserves motion information and expands codebook capacity, further enhancing the representative ability of large motion models. The release of MotionBase and the insights gained from this study are expected to pave the way for the development of more powerful and versatile motion generation models.

1 Introduction

Motion generation is an emerging field with diverse applications in video games, filmmaking, and robotics animation. At the forefront of this area is text-to-motion generation (T2M) (Ahn et al., 2018; Ahuja & Morency, 2019), which plays a crucial role in translating natural language into human motions. State-of-the-art T2M models typically rely on a combination of the motion quantization methods (e.g., VQ (Van Den Oord et al., 2017)), along with a text encoder (e.g., CLIP (Radford et al., 2021)) and decoder (e.g., GPT-2 (Radford et al., 2019)) to generate motion sequences from detailed textual instructions. Despite the availability of a few high-quality datasets (Guo et al., 2022a; Lin et al., 2024) curated in recent years, their limited size restricts current methods to a narrow range of scenarios, creating performance bottlenecks when addressing diverse or unseen motions, as illustrated in Figure 1 (RIGHT).

The rapid advancement of large language models (LLMs) (Touvron et al., 2023a) in multimodal learning has been significantly bolstered by the availability of vast data resources (Zheng et al., 2024; Xu et al., 2024). In contrast, the volume of motion data remains considerably smaller than that of visual-text data, as illustrated in Figure 1 (LEFT). This disparity primarily arises from the high costs associated with motion data collection, which often requires specialized wearable devices and substantial human labor for annotation. Consequently, developing a state-of-the-art (SoTA) large motion model based on LLMs presents a significant challenge and remains an unresolved issue. While some recent efforts (Jiang et al., 2023) have explored this direction, the effectiveness of large motion models has yet to be fully demonstrated.

In this paper, we aim to address the question: “Can a large motion model be a promising direction for motion generation?” To tackle this, we have developed a systematic data collection scheme that led to the creation of MotionBase, the first large-scale dataset containing over one million motion sequences — 15 times larger than the previous largest dataset. This initiative provides a solid foundation for building robust, universally applicable large motion models and offers a comprehensive testbed for future research.

Refer to caption
Figure 1: LEFT: Curves showing the effects of scaling up large motion models. MotionBase is the first large text-to-motion dataset comparable in scale to visual benchmarks like ImageNet. RIGHT: While existing models perform well on constrained datasets like Motion-X and HumanML3D, they struggle with out-of-domain concepts on MotionBase, exhibiting limited generalization.

Building on the solid foundation of MotionBase, we can now conduct a comprehensive investigation into the effectiveness of large motion models. This research aims to firstly identify key factors driving their advancement and offer valuable insights for future model design, including: ❶ scaling both data and model size significantly reduces joint prediction errors on critical metrics while improving generalization to novel motions. ❷ Despite observable domain gaps, synthetic and static data, as well as pseudo motion labels are becoming increasingly essential and effective, especially given the high cost of acquiring ground truth motion data. ❸ Existing metrics show limitations when faced with out-of-domain text instructions. Notably, the widely used metric, FID, fails to accurately capture the alignment between ground truth and generated motions. Our findings highlight the need for a more robust and equitable evaluation framework that enhances open-set generalization.

In addition to these factors, we argue that large motion models are further constrained by inadequate motion representation. Most approaches rely on transforming motion into discrete tokens via vector quantization (VQ), which are then processed by autoregressive models to generate motion sequences. While these methods have produced impressive results, they suffer from two major drawbacks. ❶ Information loss: The current VQ process inevitably leads to the loss of critical information. Given a motion clip with D𝐷Ditalic_D-dimensional features ={m1,m2,,mT}subscript𝑚1subscript𝑚2subscript𝑚𝑇\mathcal{M}=\{m_{1},m_{2},...,m_{T}\}caligraphic_M = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, where miDsubscript𝑚𝑖superscript𝐷m_{i}\in\mathbbm{R}^{D}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, VQ compresses it into a list of 1D embeddings of size T/α×d𝑇𝛼𝑑\lfloor T/\alpha\rfloor\times d⌊ italic_T / italic_α ⌋ × italic_d, where α𝛼\alphaitalic_α is the temporal downsampling ratio and d𝑑ditalic_d is the codebook dimension. Unlike images, which consist of uniform RGB pixel values, each motion state misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains a set of distinct features (e.g., joint position, velocity, foot-ground contact). Using a single 1D embedding to represent such complex motion states is insufficient. This not only results in the loss of vital information but also limits the model’s ability to flexibly generate motion at a part-level. ❷ Limited Codebook Size: Existing VQ are limited by a small codebook, meaning that all possible human motions must be selected from these limited options. Consequently, these 1D embeddings fail to capture the vast diversity of human motion.

To address this issue, we propose treating a motion clip as a 2D image with a single channel, represented as RT×D×1superscript𝑅𝑇𝐷1\mathcal{M}\in R^{T\times D\times 1}caligraphic_M ∈ italic_R start_POSTSUPERSCRIPT italic_T × italic_D × 1 end_POSTSUPERSCRIPT. By expanding the dimensionality of the motion clip from 1D to 2D, we enhance the encoder’s capacity, improving its ability to represent complex motions while retaining more critical information after tokenization. Although increasing the size of the codebook is a straightforward way to enhance its expressiveness, this approach often leads to “codebook collapse," particularly when training samples are scarce. To mitigate this, we introduce a finite scalar quantizing method inspired by  Mentzer et al. (2023), which enables learning a large motion vocabulary without requiring a lookup for corresponding tokens in the codebook for each entry. As a result, we expand the motion codebook by at least two orders of magnitude, boosting its representational capacity while maintaining efficiency.

We summarize our main contributions as follows. (1) MotionBase: We introduce MotionBase, the first large-scale motion generation benchmark containing over one million motions with detailed textual descriptions, significantly advancing the capability to effectively train motion generation models. (2) Key Insights: Our research identifies critical factors affecting the effectiveness of large motion models, emphasizing the importance of scaling both data and model size. Additionally, we uncover limitations in the current evaluation metrics, particularly when handling diverse and unseen motions. (3) Novel Motion Quantization: We propose a novel motion quantization approach that represents motion clips as 2D images and constructs a finite-scale codebook without requiring token lookups. This method retains essential information and expands the capacity of the motion encoder, enhancing the ability of large motion models to leverage large-scale motion data.

2 Related Work

2.1 Large Language Models and Multi-Modality

Substantial advancements have been made in enhancing LLMs (Brown et al., 2020; Raffel et al., 2020; Chowdhery et al., 2022) with the ability to understand and respond to human instructions, through a technique known as instruction tuning (Ouyang et al., 2022). Recent research has extended these capabilities to the multimodal domain (Ye et al., 2023; Zheng et al., 2023), with notable work by Liu et al. (2023), who pioneered visual instruction tuning to create a highly adaptable visual assistant. Additionally, Li et al. (2023a) integrated multimodal context directly into instruction data to further enhance model performance. Subsequent studies (Zhang et al., 2023b; Zhao et al., 2023) expanded this research by scaling up instructional datasets and incorporating image-rich text. Notably, Dai et al. (2023) developed InstructBLIP, based on BLIP-2 (Li et al., 2023b), which features an advanced visual feature extraction mechanism to improve performance across vision-language tasks. Despite these breakthroughs, the application of multimodal models to human motion remains less competitive compared to current state-of-the-art (SoTA) methods, although recent initiatives are beginning to explore this domain (Jiang et al., 2023; Zhang et al., 2024b).

2.2 Vector Quantization

Vector quantization (VQ) has been highly successful in generating high-quality images (Van Den Oord et al., 2017) and videos (Gupta et al., 2022; Yan et al., 2021). VQ-VAE first converts images into discrete representations and autoregressively models their distribution. Building on this, Lee et al. (2022) introduced residual quantization (RQ), which encodes images into a stacked map of discrete codes, efficiently reducing the spatial resolution of features. You et al. (2022) further developed hierarchical vector quantization (HQ), employing a pyramid scheme with two-level codes for image encoding. Most existing motion generation approaches have adopted VQ or its variants to quantize human motions. However, the small codebook size in traditional VQ methods limits their ability to generalize and accurately represent the diversity of human motions. Although increasing the codebook size can improve representational capacity, it often leads to codebook collapse. Recently, Mentzer et al. (2023) demonstrated that discrete codes can be obtained via scalar quantization, where each scalar entry is independently quantized to the nearest integer through rounding. Similarly, Yu et al. (2023) introduced a lookup-free codebook that maps videos into compact discrete tokens, utilizing all codes without auxiliary losses and expanding the codebook size.

2.3 Human Motion Generation

The task of motion generation involves creating human motion based on various inputs, such as text descriptions (Guo et al., 2022b; Petrovich et al., 2022), action labels (Cervantes et al., 2022; Guo et al., 2020) or motion prefixes (Liu et al., 2022; Mao et al., 2019). Among these, text-to-motion (T2M) generation has received the most attention due to the ease and flexibility of using natural language as input. Early approaches (Fragkiadaki et al., 2015; Ghosh et al., 2017; Gopalakrishnan et al., 2019) rely on deterministic motion modeling, which often produce averaged, blurry results. To overcome this, researchers introduce stochastic methods using models like GANs (Cai et al., 2018; Wang et al., 2020) or VAEs (Aliakbarian et al., 2020). For instance, T2M-GPT (Zhang et al., 2023a) extends the temporal VAE to capture the probabilistic relationship between text and motion. More recently, Guo et al. (2024) proposed improving traditional vector quantization (VQ) by integrating residual quantization and a masked modeling framework. To better align with a motion auto-encoder, MotionCLIP (Tevet et al., 2022) incorporates CLIP (Radford et al., 2021) as the text encoder, bringing in more robust text priors. Additionally, Zhang et al. (2024b) and Jiang et al. (2023) explored the development of unified models based on LLMs which accept multimodal conditions (e.g., vision, text, and pose), enabling the generation of subsequent, preceding, or “in-between” motions. Despite leveraging the power of LLMs, these large motion models remain limited to in-domain text instructions and do not yet perform as competitively as existing SoTA methods.

In this work, we aim to bridge the gap between large language models and generalized, reliable large motion models. To achieve this, We begin by introducing MotionBase — a novel, large-scale dataset designed to support extensive pretraining and comprehensive fair evaluation.

Table 1: Comparison with existing human motion datasets. More details can be found in our appendix. In the table, B, H, and F refer to body, hand, and face, respectively. “part” indicates that the text captions include fine-grained descriptions of body parts, while “body” means the descriptions are not as detailed. “multi” and “single” specify whether the dataset contains multi-person scenarios or only single-person data. Our MotionBase is the largest motion generation dataset and benchmark to date, featuring at least 15× more data than previous datasets, along with additional modalities.
SEQ NUMBER MOTION TEXT RGB DEPTH BBOX PERSON
KIT (Plappert et al., 2016) 5.7K B body single
HumanML3D (Guo et al., 2022a) 29.2K B body single
MotionX (Lin et al., 2024) 81.1K B,H,F body single
MotionBase-V1 >>>1M B,H part multi

3 MotionBase Dataset

Data is the foundation of large motion models. With advancements in fields like human pose detection, we are now able to extract high-quality motion sequences from vast amounts of online videos, including datasets like InternViD (Wang et al., 2023) and WebVid (Bain et al., 2021). In its initial public release, our MotionBase contains over one million motion clips, each annotated with fine-grained automatic pseudo labels. A comparison with existing benchmarks is presented in Table 1. Our data collection pipeline involves the following key steps in order.

Source Video Collection and Cleaning: We begin by collecting over 20 million videos from publicly available datasets and online platforms such as YouTube. To ensure quality and relevance, we filter out videos that do not contain human figures.

2D-3D Keypoint Estimation: Keypoints are essential for capturing the skeletal structure of human motion. Initially, we estimate whole-body 2D keypoints with confidence scores using a pretrained model (Xu et al., 2022). To further enhance motion accuracy, we estimate precise 3D keypoints with another pretrained model (Sárándi et al., 2023) trained on large 3D datasets, Following the method of Lin et al. (2024), we apply temporal smoothing and enforce 3D bone length constraints during triangulation, improving the stability and consistency of the keypoint estimations.

Incorporating Additional Modalities: A comprehensive understanding of human motion benefits from the inclusion of diverse modalities such as RGB and depth data. To enrich MotionBase, we provide annotations for these additional modalities. Furthermore, MotionBase includes videos featuring multi-person scenarios, with each motion sequence grounded in its corresponding video through object-level bounding boxes. Although this paper primarily focuses on the text-to-motion task, these additional modalities open avenues for future research in other areas.

Local-Global Pose Estimation: We begin by registering the body model SMPL-X (Pavlakos et al., 2019) for each frame in MotionBase, which leverages keypoints based on progressive learning-based mesh fitting method (Lin et al., 2024). Specifically, we predict SMPL-X parameters using a pretrained body mesh recovery method, OSX (Lin et al., 2023), followed by iterative optimization to fit the parameters to the target 2D and 3D joint positions. After fitting, we apply global motion optimization based on Yuan et al. (2022) to refine both global motions and camera poses simultaneously, ensuring alignment with the video evidence. Finally, for motions with noisy or occluded input data, we reconstruct complete and plausible motions using RoHM (Zhang et al., 2024a).

Hierarchical Motion Descriptions: Existing motion benchmarks face inherent limitations in their text descriptions. Previous studies (Guo et al., 2022a) typically use a single sentence to describe whole-body motions, neglecting finer details of individual body parts, such as the arms or legs. This approach restricts the ability of motion generation models to perform more nuanced body comprehension and flexible part-level motion control (e.g., raising only the left arm). Moreover, the richness of text labels often varies across different motions; for example, a large portion of the Motion-X dataset provides only action labels. In contrast, MotionBase offers hierarchical textual annotations for each video. We carefully design a prompt format and use Gemini-1.5-pro (Reid et al., 2024) to generate detailed descriptions for individual body parts (e.g., left arm, right leg), assigning a dedicated sentence to each. Additionally, we summarize the overall body movement in a paragraph containing 1–3 sentences, providing a more comprehensive description of the motion.

Refer to caption
Figure 2: Examples from MotionBase, which encompasses a diverse range of human motions, including both long-term clips and static snapshots. It features various scenes, ranging from outdoor environments to indoor settings, and includes both clean, single-person scenarios as well as crowded, multi-person scenes. Additionally, MotionBase comprises a mix of real-world data and synthetic data generated by game engines. For more details about MotionBase, please refer to Appendix A.

4 Scaling up Large Motion Model

4.1 Overall Architecture

Similar to previous LLM-based multimodal models, we treat motion as a foreign language. The overall framework is presented in Figure 11 in Appendix B. Our large motion model, built on a pre-trained LLM, functions as a generative model that connects a motion tokenizer with the LLM backbone ΘΘ\Thetaroman_Θ. The motion tokenizer encodes raw motion clip features \mathcal{M}caligraphic_M into token embeddings 𝒱={v1,v2,,vn}n×d𝒱subscript𝑣1subscript𝑣2subscript𝑣𝑛superscript𝑛𝑑\mathcal{V}=\{v_{1},v_{2},...,v_{n}\}\in\mathbbm{R}^{n\times d}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, where n𝑛nitalic_n denotes the number of motion tokens and d𝑑ditalic_d represents the dimensionality of each token. To integrate motion tokens into the LLM framework, we incorporate K𝐾Kitalic_K discrete codes in the motion codebook as additional vocabulary for the LLM. Additionally, we introduce two special tokens, <<<mot>>> and <<</mot>>>, to signify the start and end of motion sequences within the input/output streams. The LLM backbone ΘΘ\Thetaroman_Θ is built on a decoder-only architecture using causal transformers. The model generates outputs 𝒴={y1,y2,,ym}𝒴subscript𝑦1subscript𝑦2subscript𝑦𝑚\mathcal{Y}=\{y_{1},y_{2},...,y_{m}\}caligraphic_Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } in an auto-regressive manner, where 𝒴𝒴\mathcal{Y}caligraphic_Y corresponds to the generated motion sequence based on the provided motion-text input tokens. In this work, each motion-text pair in the MotionBase dataset is framed as an instruction-following instance {𝒳Q,𝒳M}subscript𝒳𝑄subscript𝒳𝑀\{\mathcal{X}_{Q},\mathcal{X}_{M}\}{ caligraphic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, representing a question-answer interaction between the user and the motion model. The entire instructional dataset adheres to this unified format. To train our model, we optimize the negative log-likelihood over the predicted tokens which is defined as follows:

(Θ)=j=1LlogPΘ(yj|desc,y^1:j1),Θsuperscriptsubscript𝑗1𝐿subscript𝑃Θconditionalsubscript𝑦𝑗𝑑𝑒𝑠𝑐subscript^𝑦:1𝑗1\mathcal{L}(\Theta)=-\sum_{j=1}^{L}\log P_{\Theta}(y_{j}|desc,\hat{y}_{1:j-1}),caligraphic_L ( roman_Θ ) = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_d italic_e italic_s italic_c , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 : italic_j - 1 end_POSTSUBSCRIPT ) , (1)

where y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and y𝑦yitalic_y denote the input and target token sequences, respectively. ΘΘ\Thetaroman_Θ represents the model parameters, and L𝐿Litalic_L is the length of the target sequence. The input description, desc𝑑𝑒𝑠𝑐descitalic_d italic_e italic_s italic_c, can be empty depending on the instruction provided.

4.2 2D Lookup-free Motion Quantization

Similar to visual tokenization, motion tokenization is a process that compresses motion signals into a series of discrete tokens, typically involving an encoder 𝔼𝔼\mathbbm{E}blackboard_E, a decoder 𝔻𝔻\mathbbm{D}blackboard_D and a codebook \mathbbm{C}blackboard_C. We propose a 2D lookup-free quantization method as a key component for building large motion models.

2D Motion Quantization. Traditional motion quantizers use 1D embeddings to represent motion at each timestamp, which inevitably results in the loss of crucial information. Furthermore, this approach limits the quantizer’s ability to generate and interpret part-level motions. To address these limitations, we treat the motion sequence ={m1,m2,,mT}subscript𝑚1subscript𝑚2subscript𝑚𝑇\mathcal{M}=\{m_{1},m_{2},...,m_{T}\}caligraphic_M = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } as a single-channel image, representing each motin sequence as T×D×1superscript𝑇𝐷1\mathcal{M}\in\mathbbm{R}^{T\times D\times 1}caligraphic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D × 1 end_POSTSUPERSCRIPT. Each motion embedding misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is divided into P𝑃Pitalic_P components, capturing distinct features of motion, such as root orientation, joint rotation and foot contact. Our motion encoder then converts \mathcal{M}caligraphic_M into a feature map 𝔼()T/α×P×d𝔼superscript𝑇𝛼𝑃𝑑\mathbbm{E}(\mathcal{M})\in\mathbbm{R}^{\lfloor T/\alpha\rfloor\times P\times d}blackboard_E ( caligraphic_M ) ∈ blackboard_R start_POSTSUPERSCRIPT ⌊ italic_T / italic_α ⌋ × italic_P × italic_d end_POSTSUPERSCRIPT, where α𝛼\alphaitalic_α denotes the temporal downsampling ratio. This approach ensures that each body part is tokenized separately, allowing for more granular, part-level motion encoding and decoding.

Lookup-Free Quantization. Traditional motion quantizers are often constrained by small codebook sizes, restricting their ability to capture the full diversity of human motion. A common approach is to expand the motion vocabulary. However, excessively enlarging the codebook can result in “codebook collapse”, where only a small subset of tokens in the codebook is used, offering minimal performance improvements. In some cases, an overly large vocabulary can even degrade the model’s overall performance. To address this issue, a more effective alternative is to reduce the dimensionality of code embeddings (Mentzer et al., 2023), which limits the representational capacity of individual tokens and encourages more efficient learning across a larger vocabulary. Similar to Yu et al. (2023), we reduce the embedding dimension of the codebook to zero by replacing the codebook K×dsuperscript𝐾𝑑\mathbbm{C}\in\mathcal{R}^{K\times d}blackboard_C ∈ caligraphic_R start_POSTSUPERSCRIPT italic_K × italic_d end_POSTSUPERSCRIPT with an integer set \mathbbm{C}blackboard_C with ||=K𝐾|\mathbbm{C}|=K| blackboard_C | = italic_K. Specifically, \mathbbm{C}blackboard_C is the Cartesian product of single-dimensional variables =absent\mathbbm{C}=blackboard_C = ×\times× Cii=1dsubscriptsuperscriptsubscript𝐶𝑖𝑑𝑖1{}_{i=1}^{d}C_{i}start_FLOATSUBSCRIPT italic_i = 1 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where Ci={1,1}subscript𝐶𝑖11C_{i}=\{-1,1\}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { - 1 , 1 } and d𝑑ditalic_d is equal to log2Ksubscript2𝐾\log_{2}Kroman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_K. Given a feature vector zd𝑧superscript𝑑z\in\mathbbm{R}^{d}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, our quantizer Q()𝑄Q(\cdot)italic_Q ( ⋅ ) converts each dimension of the quantized representation into:

Q(zi)=argmincikzicik=𝟙{zi0}+𝟙{zi>0},𝑄subscript𝑧𝑖subscriptargminsubscript𝑐𝑖𝑘normsubscript𝑧𝑖subscript𝑐𝑖𝑘1subscript𝑧𝑖01subscript𝑧𝑖0Q(z_{i})=\operatorname*{arg\,min}\nolimits_{c_{ik}}||z_{i}-c_{ik}||=-\mathbbm{% 1}\{z_{i}\leq 0\}+\mathbbm{1}\{z_{i}>0\},italic_Q ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT | | = - blackboard_1 { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 0 } + blackboard_1 { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 } , (2)

where cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the j𝑗jitalic_j-th value of Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The token index is computed as Index(z)=i=1d2i1𝟙{zi>0}𝐼𝑛𝑑𝑒𝑥𝑧superscriptsubscript𝑖1𝑑superscript2𝑖11subscript𝑧𝑖0Index(z)=\sum_{i=1}^{d}2^{i-1}\mathbbm{1}\{z_{i}>0\}italic_I italic_n italic_d italic_e italic_x ( italic_z ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT blackboard_1 { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 }. To train the tokenizer, we employ a standard combination of reconstruction, perceptual, and commitment losses, along with an entropy penalty to promote better codebook utilization (Yu et al., 2023). Importantly, we exclude the use of GAN loss, as it was found to negatively impact training stability.

5 Experiments

5.1 Experimental Setup

Datasets. Our investigation first is conducted on the following text-to-motion datasets: HumanML3D (Guo et al., 2022a) and Motion-X (Lin et al., 2024). HumanML3D comprises 14,616 motion clips sourced from the AMASS dataset (Mahmood et al., 2019), paired with 44,970 textual descriptions. Motion-X, a more recent dataset, includes approximately 81,000 motion clips. To validate our conclusions on larger-scale data, we also carry out experiments on the proposed MotionBase dataset with two variants: MotionBase-0.5 and MotionBase-1.0. MotionBase-0.5 contains 500,000 clips, while MotionBase-1.0 encompasses the full scope of our collected data, with over 1 million clips. Following standard practice, each dataset is split into training, validation, and test sets in proportions of 85%, 5%, and 15%, respectively.

Evaluation Metrics. For the motion generation task, we employ the following metrics in our experiments following Guo et al. (2022a). (1) Frechet Inception Distance (FID): This metric assesses overall motion quality by measuring the distributional difference between the high-level features of generated motions and real motions. (2) Motion-retrieval Precision (R-Precision) and Multimodal Distance (MMDist): These metrics evaluate the semantic alignment between the textual input and generated motions. R-Precision measures the top-1/2/3 retrieval accuracy, while MMDist computes the distance between matched text and motion pairs. Additionally, we validate our motion tokenizer by conducting experiments on the motion reconstruction task. This is measured using both Mean Per Joint Position Error (MPJPE) and FID. MPJPE quantifies the average distance (in millimeters) between the predicted joint positions and the ground truth positions across all joints in the skeleton.

Implementation Details. For the motion tokenizer, we implement a VQ codebook 1024×512superscript1024512\mathbbm{C}\in\mathbbm{R}^{1024\times 512}blackboard_C ∈ blackboard_R start_POSTSUPERSCRIPT 1024 × 512 end_POSTSUPERSCRIPT with an embedding dimensionality of d=512𝑑512d=512italic_d = 512, and the resulting discrete codes are incorporated as additional vocabulary for the LLM. In comparison, our lookup-free codebook has a size of 216=16384superscript216163842^{16}=163842 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT = 16384, where the least frequently used tokens from the LLM’s codebook are mapped to represent motion codes. The motion encoder 𝔼𝔼\mathbbm{E}blackboard_E operates with a temporal downsampling rate of α=4𝛼4\alpha=4italic_α = 4. We experiment with four LLM architectures to build our large motion model: GPT2-medium (Radford et al., 2019), Llama-2-7b, Llama-2-13b (Touvron et al., 2023b), and Llama3.1-8b (Dubey et al., 2024). The motion tokenizer is trained with a learning rate of 1e-4 and a batch size of 256 over 300K iterations. For training the large motion model, full parameter tuning is performed on 8×\times×A800 GPUs, with a batch size of 1024, over 300 epochs. The learning rate is set to 2e-4 for GPT2-medium and 2e-5 for the Llama models. Further details are provided in the appendix due to space limitation.

Table 2: Comparisons under different model and data sizes. All experiments are conducted using the same pretrained VQ model for consistency. Additionally, we re-train the motion autoencoder and text encoder  (Guo et al., 2022a) separately on the Motion-X and MotionBase datasets, using their respective data to train the motion autoencoder for each dataset’s evaluation.
Motion-X MotionBase
Decoder #Inst. #Param. R@1 \uparrow R@3 \uparrow FID \downarrow R@1 \uparrow R@3 \uparrow FID \downarrow
GPT-2 0.02M 700M 0.206 0.402 54.017 0.046 0.136 173.275
GPT-2 0.08M 700M 0.468 0.791 0.096 0.090 0.215 251.358
GPT-2 0.5M 700M 0.358 0.618 4.852 0.116 0.276 157.950
GPT-2 1M 700M 0.357 0.614 5.083 0.118 0.269 121.917
LLaMA-2 0.02M 7B 0.207 0.405 53.354 0.042 0.123 160.845
LLaMA-2 0.08M 7B 0.471 0.794 0.159 0.093 0.222 253.289
LLaMA-2 0.5M 7B 0.372 0.627 4.908 0.125 0.272 87.288
LLaMA-2 1.0M 7B 0.351 0.602 5.582 0.125 0.267 83.024
LLaMA-3 0.02M 8B 0.217 0.418 54.004 0.043 0.124 162.102
LLaMA-3 0.08M 8B 0.483 0.802 0.103 0.082 0.214 249.790
LLaMA-3 0.5M 8B 0.363 0.625 4.798 0.121 0.264 81.389
LLaMA-3 1M 8B 0.354 0.611 5.100 0.129 0.270 68.083
LLaMA-2 0.02M 13B 0.225 0.436 53.447 0.045 0.125 159.368
LLaMA-2 0.08M 13B 0.486 0.805 0.132 0.086 0.218 249.868
LLaMA-2 0.5M 13B 0.375 0.636 4.792 0.116 0.267 80.473
LLaMA-2 1.0M 13B 0.359 0.612 5.370 0.131 0.277 78.665

5.2 Discussion of Scaling up motion generation

In this section, we investigate the impact of model size and data scale on motion generation performance. We utilize the motion autoencoder  (Guo et al., 2022a) retrained on Motion-X and MotionBase datasets to evaluate performance on their respective test sets. We categorize our training data into four scales: 0.02M (HumanML3D only), 0.08M (Motion-X only), 0.5M (MotionBase-0.5), and 1M (MotionBase-1.0). To ensure fair comparison, we employ the same VQ as the motion tokenizer, maintaining consistency across experiments to validate our conclusions.

Table 3: Comparison with existing SoTA methods on the HumanML3D benchmark. Results marked with * represent values reproduced using the officially released code, while unmarked results are taken from the original papers.
Decoder R@1 \uparrow R@3 \uparrow FID \downarrow MMDist \downarrow
MLD - 0.481 0.772 0.473 3.196
MotionDiffuse - 0.491 0.782 0.630 3.113
T2M-GPT GPT-2 0.492 0.775 0.141 3.121
MotionGPT1,∗ T5 0.409 0.667 0.162 3.992
MotionGPT1 T5 0.492 0.778 0.232 3.096
MotionGPT2,∗ Llama-2-13B 0.367 0.654 0.571 3.981
MotionGPT2,∗ Llama-1-13B 0.363 0.633 0.592 4.029
MotionGPT2 Llama-1-13B 0.411 0.696 0.542 3.584
MotionLLM Gemma-2b 0.482 0.770 0.491 3.138
AvatarGPT Llama-1-13B 0.389 0.623 0.567 -
Ours Llama-2-13B 0.519 0.803 0.166 2.964

Does increasing model size benefit motion generation? Yes. As shown in Table 2, our results demonstrate that increasing model size leads to significant performance improvements when provided with the same amount of training data. Specifically, Llama2-13b outperforms Llama2-7b, which in turn surpasses GPT2-medium, illustrating a clear trend of performance gains as model capacity increases. This suggests that models with larger size are better equipped to capture diverse, complex patterns and relationships within human motions.

Does increasing data scale benefit motion generation? Yes. In Table 2, when using the same foundation model, increasing the scale of training data leads to substantial performance gains on the MotionBase test set, aligning with our expected scaling laws. This improvement is particularly pronounced in the R-precision metric, emphasizing the critical role of data scale in enhancing semantic alignment between generated motions and text prompts. However, contrary to our expectations, we observe a noticeable performance decline on the Motion-X test set if not trained on Motion-X (0.08M). We attribute this to the limitations of the retrieval-based evaluation model, as discussed in Section 5.4.

Refer to caption
Figure 3: Training curves with Y-axis denoting R@1 retrieval accuracy. All these models are trained for 300 epochs at most and are evaluated every 1000 steps.

Does the large motion model perform SoTA competitively? We evaluate our large motion model on the widely adopted HumanML3D benchmark. We compare its performance against a variety of SoTA approaches. This includes diffusion-based methods such as MLD (Chen et al., 2023) and MotionDiffuse (Zhang et al., 2022), as well as the GPT-based T2M-GPT (Zhang et al., 2023a). We also compare against LLM fine-tuning methods like MotionGPT (Jiang et al., 2023; Zhang et al., 2024b), MotionLLM (Wu et al., 2024), and AvatarGPT (Zhou et al., 2024). As shown in Table 3, our model, which utilizes Llama-2-13B as the decoder and calculates the loss over the entire concatenated sequence of input text, achieves SOTA performance. Our large motion model significantly outperforms other LLM-based methods such as MotionGPT and AvatarGPT, as well as the earlier T2M-GPT. In particular, we observe substantial improvements in key metrics such as R@1, R@3, and MMDist, highlighting our model’s ability to generate motion sequences that are better aligned with text descriptions and of higher quality.

Slow convergence of large motion models. To evaluate the convergence speed of large motion models, we train GPT-2, Llama2-7b, and Llama3.1-8b for 300 epochs on Motion-X. The training curve of with R@1 performance is illustrated in Figure 3. We obverse that all large motion models nearly converge by 200 epochs, with larger models converging faster. Initializing these models with pre-trained weights proves beneficial for speeding up convergence. Compared to large multimodal models like LLaVA (Liu et al., 2023), large motion models require more epochs to capture the complex representations of motion sequences. We attribute the slow convergence of these models to the limited representation capacity of the motion tokenizer, which contains only 512 motion tokens. This suggests the need to optimize the motion tokenizer and expand its representation space. To address this, we explore 2D-LFQ quantization method as a promising alternative.

Table 4: Ablation of the effectiveness of synthetic data and static data.
TRAIN SET R@1 \uparrow R@3 \uparrow FID \downarrow
w/o static & syn 0.101 0.231 261.325
w/o static 0.110 0.248 286.809
MotionBase 0.118 0.269 121.917

Does Static and Synthetic Data help? Yes, the addition of static image data and synthesized data both contribute to improvements, as illustrated in Table LABEL:tab:syn_and_static_part, more analysis can be found in Appendix C.1.

Table 5: Ablation of out-of-domain evaluation on UNSEEN-90K dataset, where #N#𝑁\#N# italic_N denotes we use N𝑁Nitalic_N subsets of MotionBase for training.
TRAIN SET R@1 \uparrow R@3 \uparrow FID \downarrow
HumanML3D 0.0264 0.0832 257.563
MotionX 0.0224 0.0705 246.220
MotionBase-#38 0.0761 0.2090 263.539

Do large motion models outperform in out-of-distribution setup? Yes. We present the results in Table LABEL:tab:out_of_distribute. This ablation is essential for further validating the generalization capabilities of large motion models, as the improvements observed in Table 2 may stem from the inclusion of additional in-domain data from Motion-X. In this setup, we select four subsets from MotionBase, comprising 90K samples (UNSEEN-90K), for evaluation, while the remaining 38 subsets are used for training. This ensures that the test set consists entirely of out-of-domain (OOD) samples. We compare the performance of models trained on HumanML3D, MotionX, and Motion-#38, all utilizing the GPT2-medium architecture, where #N#𝑁\#N# italic_N denotes the number of training subsets. All models are trained using the GPT2-medium. The results on the OOD test set clearly demonstrate that the model trained on MotionBase significantly outperforms those trained on HumanML3D and MotionX, particularly in terms of R@1 and R@3 metrics. These findings strongly highlight the superior generalization ability of large motion models when handling unseen OOD data, especially when trained on diverse, large-scale datasets. However, we once again observe unexpected results with the FID metric, which will be discussed further in Section 5.4.

Refer to caption
Figure 4: Comparison with different motion quantization on Motion-X (left) and MotionBase (right). Note that we only show MPJPE (\downarrow) results here. FID results is shown in Appendix C.5.

5.3 Discussion of Motion Quantization

In this section, we investigate the impact of different motion quantization methods. We compare our proposed 2D lookup-free quantization (2D-LFQ) against two commonly used approaches: residual vector quantization (RVQ) and vector quantization (VQ), across various codebook sizes ranging from 28superscript282^{8}2 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT to 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. The number of parameters for RVQ/VQ and 2D-LFQ are 19.43M and 108.35M, respectively. As shown in Figure 4, 2D-LFQ demonstrates significant improvements over both RVQ and VQ. Notably, as the codebook size increases, 2D-LFQ continues to enhance performance, while RVQ and VQ experience diminishing returns or performance degradation with larger codebooks. Our deeper analysis attributes these gains to better codebook utilization by 2D-LFQ. Figure 5 illustrates that the utilization rates for VQ and RVQ begin to decline once the codebook size exceeds 210superscript2102^{10}2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, which corresponds to the peak performance for these methods, whereas the utilization of 2D-LFQ continues to increase with larger codebooks. Additionally, we conduct further experiments to validate the benefits of 2D motion encoding in Appendix C.5.

Refer to caption
Figure 5: Comparison of codebook utilization for different motion quantization methods.

5.4 Limitation of Automated Metric

As mentioned earlier, the FID scores in Table 2 and Table LABEL:tab:out_of_distribute yield unexpected results. Specifically, when evaluating on Motion-X and UNSEEN-90K, FID achieves its best performance when trained on Motion-X, significantly outperforming both the smaller HumanML3D and the larger-scale MotionBase. In this section, we aim to investigate this anomaly. FID, a standard metric widely used for generation tasks, is typically measured by a pretrained evaluator. In traditional image generation, FID is calculated using a well-trained, robust visual encoder like InceptionNet (Szegedy et al., 2015), which is trained on millions of images. However, the evaluator currently used to compute FID for motion generation is a simple motion autoencoder with a very small parameter scale (Guo et al., 2022a). Since this motion autoencoder is trained on limited data consisting of only 20K motions, we argue that it may lack the generalization needed for robust performance, leading to difficulties in reliably capturing the complex semantic alignment between text and motion.Similar unexpected results occur in motion reconstruction as well. As show in Table 6, the FID score on HumanML3D is two orders of magnitude higher when comparing 2D-LFQ and VQ-VAE, despite the former achieving a much lower MPJPE. When tested on MotionBase, 2D-LFQ obtains the highest FID score even while achieving the best MPJPE. We observe the same issue with other metrics like MMDist, as discussed in Appendix C.1. Notably, Voas et al. (2023) have mentioned that existing metrics are sensitive to the quality of the embedding space and do not always align with human perception. These findings highlight the need for a more robust and fair evaluation metric for large motion models moving forward.

Table 6: Robustness investigation of the evaluation metrics on the motion reconstruction task.
HumanML3D Motion-X MotionBase
Tokenizer #Num. #Param. FID \downarrow MPJPE \downarrow FID MPJPE FID MPJPE
VQ-VAE 512 19.43M 0.078 69.2 0.852 106.4 4.366 123.6
RQ-VAE 512 19.43M 0.05 37.5 0.568 56.9 4.026 78.2
2D-LFQ 16384 108.35M 1.769 45.6 0.295 54.1 7.853 64.1

6 Conclusion

In this paper, we explore how to advance the field of large-scale motion generation. To this end, we introduce a large-scale motion dataset named MotionBase, which includes detailed text descriptions and rich modality annotations, providing a strong foundation for effectively training large motion models. Our research highlights key findings, such as the impact of scaling both data and model size. Additionally, we identify potential limitations in the current evaluation metrics, particularly when assessing diverse and unseen motions. To enhances the benefits large motion models can derive from extensive motion data, we propose a novel motion quantization approach that treats motion clips as 2D images and constructs a finite-scale codebook, eliminating the need for token lookups. We hope that this research offers valuable direction for future work in large-scale motion generation.

References

  • Ahn et al. (2018) Hyemin Ahn, Timothy Ha, Yunho Choi, Hwiyeon Yoo, and Songhwai Oh. Text2action: Generative adversarial synthesis from language to action. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.  5915–5920. IEEE, 2018.
  • Ahuja & Morency (2019) Chaitanya Ahuja and Louis-Philippe Morency. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pp.  719–728. IEEE, 2019.
  • Aliakbarian et al. (2020) Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, and Stephen Gould. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5223–5232, 2020.
  • Bain et al. (2021) Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  1728–1738, 2021.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Cai et al. (2018) Haoye Cai, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. Deep video generation, prediction and completion of human action sequences. In Proceedings of the European conference on computer vision (ECCV), pp.  366–382, 2018.
  • Cervantes et al. (2022) Pablo Cervantes, Yusuke Sekikawa, Ikuro Sato, and Koichi Shinoda. Implicit neural representations for variable length human motion generation. In European Conference on Computer Vision, pp.  356–372. Springer, 2022.
  • Chen et al. (2023) Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18000–18010, 2023.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • Chung et al. (2021) Jihoon Chung, Cheng-hsin Wuu, Hsuan-ru Yang, Yu-Wing Tai, and Chi-Keung Tang. Haa500: Human-centric atomic action dataset with curated videos. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  13465–13474, 2021.
  • Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  • Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  • Fragkiadaki et al. (2015) Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. Recurrent network models for human dynamics. In Proceedings of the IEEE international conference on computer vision, pp.  4346–4354, 2015.
  • Ghosh et al. (2017) Partha Ghosh, Jie Song, Emre Aksan, and Otmar Hilliges. Learning human motion models for long-term predictions. In 2017 International Conference on 3D Vision (3DV), pp.  458–466. IEEE, 2017.
  • Gopalakrishnan et al. (2019) Anand Gopalakrishnan, Ankur Mali, Dan Kifer, Lee Giles, and Alexander G Ororbia. A neural temporal model for human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12116–12125, 2019.
  • Guo et al. (2020) Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pp.  2021–2029, 2020.
  • Guo et al. (2022a) Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5152–5161, 2022a.
  • Guo et al. (2022b) Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision, pp.  580–597. Springer, 2022b.
  • Guo et al. (2024) Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1900–1910, 2024.
  • Gupta et al. (2022) Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Li Fei-Fei. Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894, 2022.
  • Jiang et al. (2023) Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems, 36:20067–20079, 2023.
  • Lee et al. (2022) Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11523–11532, 2022.
  • Li et al. (2023a) Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
  • Li et al. (2023b) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  • Lin et al. (2023) Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. One-stage 3d whole-body mesh recovery with component aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21159–21168, 2023.
  • Lin et al. (2024) Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36, 2024.
  • Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  • Liu et al. (2022) Zhenguang Liu, Shuang Wu, Shuyuan Jin, Shouling Ji, Qi Liu, Shijian Lu, and Li Cheng. Investigating pose representations and motion contexts modeling for 3d motion prediction. IEEE transactions on pattern analysis and machine intelligence, 45(1):681–697, 2022.
  • Mahmood et al. (2019) Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  5442–5451, 2019.
  • Mao et al. (2019) Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9489–9497, 2019.
  • Mehta et al. (2017) Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), pp.  506–516. IEEE, 2017.
  • Mentzer et al. (2023) Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
  • OpenAI (2024) OpenAI. GPT-4o mini: advancing cost-efficient intelligence. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  • Pavlakos et al. (2019) Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10975–10985, 2019.
  • Petrovich et al. (2022) Mathis Petrovich, Michael J Black, and Gül Varol. Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pp.  480–497. Springer, 2022.
  • Plappert et al. (2016) Matthias Plappert, Christian Mandery, and Tamim Asfour. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • Reid et al. (2024) Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  • Sárándi et al. (2023) István Sárándi, Alexander Hermans, and Bastian Leibe. Learning 3d human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  2956–2966, 2023.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1–9, 2015.
  • Taheri et al. (2020) Omid Taheri, Nima Ghorbani, Michael J Black, and Dimitrios Tzionas. Grab: A dataset of whole-body human grasping of objects. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp.  581–600. Springer, 2020.
  • Tevet et al. (2022) Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pp.  358–374. Springer, 2022.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  • Voas et al. (2023) Jordan Voas, Yili Wang, Qixing Huang, and Raymond Mooney. What is the best automated metric for text to motion generation? In SIGGRAPH Asia 2023 Conference Papers, pp.  1–11, 2023.
  • Wang et al. (2023) Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
  • Wang et al. (2020) Zhenyi Wang, Ping Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan, and Changyou Chen. Learning diverse stochastic human-action generators by learning smooth latent transitions. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  12281–12288, 2020.
  • Wu et al. (2024) Qi Wu, Yubo Zhao, Yifan Wang, Yu-Wing Tai, and Chi-Keung Tang. Motionllm: Multimodal motion-language learning with large language models. arXiv preprint arXiv:2405.17013, 2024.
  • Xu et al. (2024) Boshen Xu, Ziheng Wang, Yang Du, Sipeng Zheng, Zhinan Song, and Qin Jin. Egonce++: Do egocentric video-language models really understand hand-object interactions? arXiv preprint arXiv:2405.17719, 2024.
  • Xu et al. (2022) Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems, 35:38571–38584, 2022.
  • Yan et al. (2021) Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  • Ye et al. (2023) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  • You et al. (2022) Tackgeun You, Saehoon Kim, Chiheon Kim, Doyup Lee, and Bohyung Han. Locally hierarchical auto-regressive modeling for image generation. Advances in Neural Information Processing Systems, 35:16360–16372, 2022.
  • Yu et al. (2023) Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
  • Yuan et al. (2022) Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11038–11049, 2022.
  • Zhang et al. (2023a) Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  14730–14740, 2023a.
  • Zhang et al. (2022) Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  • Zhang et al. (2024a) Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, and Federica Bogo. Rohm: Robust human motion reconstruction via diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14606–14617, 2024a.
  • Zhang et al. (2023b) Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023b.
  • Zhang et al. (2024b) Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. Motiongpt: Finetuned llms are general-purpose motion generators. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  7368–7376, 2024b.
  • Zhao et al. (2023) Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  • Zheng et al. (2023) Sipeng Zheng, Yicheng Feng, Zongqing Lu, et al. Steve-eye: Equipping llm-based embodied agents with visual perception in open worlds. In The Twelfth International Conference on Learning Representations, 2023.
  • Zheng et al. (2024) Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, and Zongqing Lu. Unicode: Learning a unified codebook for multimodal large language models. arXiv preprint arXiv:2403.09072, 2024.
  • Zhou et al. (2024) Zixiang Zhou, Yu Wan, and Baoyuan Wang. Avatargpt: All-in-one framework for motion understanding planning generation and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1357–1366, 2024.
\appendixpage

Appendix A Additional Details of MoseBase

In this section, we provide more details about Motionbase that are not included in the main paper due to spatial limitations.

A.1 Statistic Analyses

MotionBase contains over 1 million motion sequences from 42 different public datasets and web videos on the Internet. Subsets of MotionX, including Animation, Perform, Dance, Aist, Kungfu, GRAB (Taheri et al., 2020), Music, Idea400 (Lin et al., 2024), HAA500 (Chung et al., 2021), Game Motion, and Fitness, are included in MotionBase. Recognizing the high cost of collecting and annotating videos, we also see the untapped potential of images for motion understanding. Consequently, MotionBase incorporates image data by repeating each image across 64 frames and treating it as a motion sequence. For the datasets with long-range videos, such as MPI-INF-3DHP (Mehta et al., 2017), we segment the footage into sub-clips with random durations ranging from 10 seconds to one minute. Figure 6 and Figure 7 illustrate the scale and length distributions of MotionBase.

Refer to caption
Figure 6: The scale distribution of motion sequences across subsets of MotionBase.
Refer to caption
Figure 7: The length distribution across different subsets of MotionBase
Refer to caption
Figure 8: Prompt examples to label human motions in the video. We use Gemini-1.5-pro and GPT-4o-mini to generate motion descriptions for the video and image data, respectively. We provide “whole-body” (UP) and “part-level” (DOWN) labels for each sample in the dataset.

A.2 Prompt of Motion Description

In this paper, we use Gemini-1.5-pro (Reid et al., 2024) and GPT-4o-mini (OpenAI, 2024) as large multimodal models (LMM) to generate textual annotations for video and image data, respectively. For each person-centric sample, we first crop and track the person’s body using the corresponding bounding box(es). The LMM is then tasked with focusing on the person’s physical movements and positions in the global space to generate detailed descriptions. Unlike previous datasets, we provide more granular motion descriptions by dividing the body into upper and lower sections, prompting the LMM to generate part-specific descriptions (“part-level”). Additionally, an overall summary of the entire body’s movement (“whole-body”) is also produced. Figure 8 illustrates the prompt used to caption human motion sequences in MotionBase.

A.3 Word Distribution Analysis

To further explore the annotated motion text, we generate word clouds from the entire text corpus in MotionBase. Since the annotations in MotionBase consist of both whole-body and part-level descriptions, we create separate word clouds for general labels and more detailed annotations, as shown in Figure 9 and Figure 10, respectively. In Figure 9, we observe that the whole-body annotations primarily highlight high-level motion activities, such as standing, sitting, and walking. In contrast, Figure 10 shows that part-level annotations focus more on specific body movements, including the torso, shoulders, legs, and arms. We believe that this hierarchical structure of annotations will enhance the understanding of motion.

Refer to caption
Figure 9: Word cloud of whole-body textual annotation in MotionBase.
Refer to caption
Figure 10: Word cloud of part-level textual annotation in MotionBase.

Appendix B Additional Overview of Model Architecture

Due to space limitations in the main paper, we provide the overview of our model architecture in Figure 11 in this appendix. Following most LMMs, our large motion model consists of two stages: pre-training and fine-tuning. During the pre-training stage, we train a motion encoder, a motion decoder, and a motion codebook to represent motions using discrete tokens. With this motion tokenizer, we fine-tune an autoregressive language model to predict motion tokens. In the inference stage, the input text is processed by the language model to generate motion tokens in an autoregressive manner, which are then decoded into natural motion by the pre-trained motion decoder.

Refer to caption
Figure 11: Overview of the large motion model, which can be divided into two stages. In the first stage(left), we pre-train a motion VQ-VAE to quantify motion sequences into tokens. In the second stage(right), we fine-tune an autoregressive language model to predict motion tokens.

Appendix C Additional Experimental Results

In this section, we provide more experimental analysis which can not be presented in our main paper due to space limitation.

Table 7: Ablation of the effectiveness of synthetic data and static data.
TRAIN SET R@1 \uparrow R@3 \uparrow FID \downarrow MMDist \downarrow
w/o static & syn 0.101 0.231 261.325 5.201
w/o static 0.110 0.248 286.809 5.213
MotionBase 0.118 0.269 121.917 7.644

C.1 Ablation of Synthesis and Static Data?

To assess the effectiveness of synthetic and static data, we conduct a series of ablation experiments. We train GPT2-medium on three variations of MotionBase: without synthetic data, without image data, and without both synthetic data and image data. The model is trained for 300 epochs with a learning rate of 2e-4. Performance is tested on the Motion-X test set using the VQ-VAE and retrieval model trained on MotionBase, with results shown in Table LABEL:tab:syn_and_static. Our findings indicate that incorporating both static data (i.e., image data) and synthetic data leads to performance improvements in terms of R-Precision. Additionally, we observe that the trend of MMDist is opposite to that of R-Precision. This could be attributed to MMDist’s sensitivity to the quality of the embedding space. When the motion and text encoders have limited capacity, this metric may struggle to discern the quality of generated motions. This phenomenon highlights the importance of developing more robust evaluation metrics and models.

Table 8: Comparison of evaluations using different encoder models.
EM_Humanml3d EM_Motion-X
Decoder #Inst. #Param. R@1 \uparrow R@3 \uparrow FID \downarrow R@1 \uparrow R@3 \uparrow FID \downarrow
GPT-2 0.02M 700M 0.466 0.752 0.101 0.358 0.651 0.050
GPT-2 0.08M 700M 0.462 0.744 0.208 0.362 0.656 0.754
LLaMA-2 0.02M 7B 0.497 0.778 0.214 0.378 0.671 0.122
LLaMA-2 0.08M 7B 0.474 0.758 0.452 0.376 0.673 0.518
LLaMA-3 0.02M 8B 0.500 0.783 0.173 0.380 0.675 0.094
LLaMA-3 0.08M 8B 0.499 0.786 0.264 0.393 0.696 0.591
LLaMA-2 0.02M 13B 0.519 0.803 0.166 0.395 0.695 0.105
LLaMA-2 0.08M 13B 0.504 0.790 0.393 0.400 0.700 0.637

C.2 Ablation of Different Encoder Models

Table 8 presents the evaluation results on the HumanML3D test set using different encoder models (EM). We employ the same dual-encoder architecture (Guo et al., 2022a) but trained it on two distinct datasets: HumanML3D and Motion-X, where HumanML3D is a subset of Motion-X. The results highlight the limited generalization ability of the encoder model. When using the model trained on the larger Motion-X dataset, performance metrics on HumanML3D decrease. This suggests that training on the broader Motion-X dataset negatively impacts R-Precision performance on the HumanML3D subset. Furthermore, when the encoder model is trained on Motion-X, increasing the training data size for the text-to-motion model leads to significant performance gains. Conversely, when using the encoder model trained on HumanML3D, the performance of the text-to-motion model degrades as the training data size increases. This might be attributed to inherent limitations in the encoder model itself.

Table 9: Comparison between fine-tuning and learning from scratch on the Motion-X test set.
#Inst From Sctrach R@1 \uparrow R@3 \uparrow FID \downarrow MMDist \downarrow
0.02M Yes 0.035 0.103 16.904 9.280
0.02M No 0.206 0.402 54.017 8.218
0.08M Yes 0.460 0.782 0.113 2.862
0.08M No 0.468 0.791 0.096 2.798

C.3 Ablation of Learning from Scratch vs. Fine-tuning

We compare the performance of fine-tuning GPT-2 against training it from scratch (random initialization). The results show that fine-tuned models consistently outperform those trained from scratch, particularly when trained on HumanML3D and evaluated on MotionX. The improvement of pretrained LLM highlights the importance of text pre-training in enhancing the model’s understanding of text descriptions and improving its generalization capabilities.

C.4 Ablation of Different Loss Calculation Strategies

Table 10: Results of different loss calculation methods on the HumanML3D test set.
Loss Calculation R@1 \uparrow R@3 \uparrow FID \downarrow MMDist \downarrow
Motion Seq Loss 0.388 0.650 0.680 3.919
Whole Seq Loss 0.466 0.752 0.101 3.234

We also investigate the impact of different loss calculation strategies on model performance: We compare two strategies: 1) calculating the loss solely on the output motion tokens, and 2) calculating the loss on both the input text and the output motion tokens. As shown in Table LABEL:tab:training_obj, our results indicate that the second strategy yields better performance. This improvement compared to the first alternative is likely due to the strategy’s ability to prevent catastrophic forgetting of text understanding. Additionally, it helps mitigate overfitting to motion patterns in the training data, thereby enhancing the model’s generalization ability.

Refer to caption
Figure 12: Comparison with different motion quantization on the Motion-X (left) and MotionBase dataset (right). The Y-axis denotes FID (\downarrow).

C.5 Ablation of Motion Quantization

First, we provide additional FID results on Motion-X in Figure 12. It is worth noting that while our motion quantizer performs worse than RQ-VAE on the smaller HumanML3D dataset, it surpasses both VQ and RQ when evaluated on the larger Motion-X and MotionBase benchmarks, as can be seen in Table 6. This suggests that our approach offers a greater advantage when applied to larger datasets, highlighting its improved generalization compared to previous methods.

To further validate the effectiveness of our 2D quantization strategy, we compare the 2D-LFQ method with its 1D counterpart (which is identical to VQ except for the quantization strategy). The results, shown in Table 11, demonstrate that 2D quantization in LFQ significantly outperforms the 1D version. This highlights the superior ability of 2D quantization to enhance the representational capacity of the motion tokenizer.

Table 11: Ablation of 2D motion quantization vs. its 1D version.
HumanML3D Motion-X MotionBase
Tokenizer #Num. #Param. FID \downarrow MPJPE \downarrow FID MPJPE FID MPJPE
1D-LFQ 16384 19.43M 3.85 52.5 2.783 78.9 10.358 80.1
2D-LFQ 16384 108.35M 1.769 45.6 0.295 54.1 7.853 64.1

Appendix D Additional Quantitative Results

We provide some examples to visualize the human motions predicted by our large motion model trained on MotionBase, as illustrated in Figure 13. As can be seen, our large motion model is capable of generating motion sequences that align well with the input texts, demonstrating the effectiveness of the MotionBase dataset.

Refer to caption
Figure 13: Quantitative examples of motions generated by our large motion model.