Dynamic facial expression generation from natural language is a crucial task in Computer Graphics, with applications in Animation, Virtual Avatars, and Human-Computer Interaction. However, current generative models suffer from datasets that are either speech-driven or limited to coarse emotion labels, lacking the nuanced, expressive descriptions needed for fine-grained control, and were captured using elaborate and expensive equipment. We hence present a new dataset of facial motion sequences featuring nuanced performances and semantic annotation. The data is easily collected using commodity equipment and LLM-generated natural language instructions, in the popular ARKit blendshape format. This provides riggable motion, rich with expressive performances and labels. We accordingly train two baseline models, and evaluate their performance for future benchmarking. Using our Express4D dataset, the trained models can learn meaningful text-to-expression motion generation and capture the many-to-many mapping of the two modalities.
Express4D is a new benchmark dataset designed to advance the generation of dynamic facial expressions from natural language. Unlike existing datasets that rely on speech or coarse emotion labels, Express4D provides over 1200 richly expressive 3D facial motion sequences, each paired with a fine-grained, free-text prompt. These expressions were intentionally performed by real people in response to natural language instructions, and captured using the ARKit blendshape format with consumer-grade iPhones — enabling easy integration into animation pipelines. To demonstrate the dataset’s potential, we trained and evaluated two baseline models (diffusion-based and VQ-based) for text-to-expression generation, using adapted metrics from human motion literature. Results show that Express4D enables plausible, semantically-aligned facial motion generation from text, while revealing current challenges in handling complex or multi-stage expressions.
The Express4D dataset was created by generating diverse natural language prompts using ChatGPT, each describing a dynamic facial expression. These prompts were then performed by 18 participants using the Live Link Face app on an iPhone with a TrueDepth camera, which captured 3D facial motion at 60 frames per second in the form of 61 ARKit blendshape coefficients per frame. Each sequence is paired with its corresponding prompt and stored as a CSV file, making it lightweight and easy to use in training pipelines.
To evaluate facial motion generation on Express4D, we trained two baseline models adapted from leading text-to-human motion generation approaches: a diffusion-based transformer (MDM) and a VQ-VAE model (T2M-GPT). Both models were trained to generate facial expressions from free-form text prompts and evaluated using standard metrics from human motion generation. The results show that both models produce realistic and expressive outputs aligned with the input text, with each architecture excelling in different aspects, establishing Express4D as a strong benchmark for future research in text-to-expression generation.
“A person is pressing their lips together firmly and glancing sideways nervously.”
“A person is frowning slightly.”
“A person is furrowing their brows deeply and holding their mouth slightly open.”
“A person is exhaling quietly while their expression shifts from neutral to softly amused.”
“A person is rolling their eyes dramatically and scoffing.”
“A person is opening their mouth and closing it quickly.”
“A person is squinting briefly and then opening their eyes wide.”
“A person is widening their eyes and looking up suddenly as if surprised.”
“A person is softly nodding while maintaining a relaxed, even expression.”
A key advantage of Express4D is its scalability. Since the dataset was created using standard mobile devices and an LLM-based captioning pipeline, it can be easily extended with new recordings. To support this, we provide a dedicated website where contributors can receive text prompts and upload their facial motion captures. Submissions will be sampled and reviewed to ensure quality before inclusion. This framework enables broad community participation while maintaining the dataset’s expressiveness and consistency.
A link to the contribution website will be provided upon its launch
addBibtex