MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation

Tsinghua University, FACEGOOD Inc

The demonstration video of the paper. We first show the data examples of MMFace4D dataset. Then we show examples of audio driven 3D face animation.

Abstract

Audio-Driven Face Animation is an eagerly anticipated technique for applications such as VR/AR, games, and movie making. With the rapid development of 3D engines, there is an increasing demand for driving 3D faces with audio. However, currently available 3D face animation datasets are either scale-limited or quality-unsatisfied, which hampers further developments of audio-driven 3D face animation. To address this challenge, we propose MMFace4D, a large-scale multi-modal 4D (3D sequence) face dataset consisting of 431 identities, 35,904 sequences, and 3.9 million frames. MMFace4D exhibits two compelling characteristics: 1) a remarkably diverse set of subjects and corpus, encompassing actors spanning ages 15 to 68, and recorded sentences with durations ranging from 0.7 to 11.4 seconds. 2) It features synchronized audio and 3D mesh sequences with high-resolution face details. To capture the subtle nuances of 3D facial expressions, we leverage three synchronized RGB-D cameras during the recording process. Upon MMFace4D, we construct a non-autoregressive framework for audio-driven 3D face animation. Our framework considers the regional and composite natures of facial animations, and surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively. The code, model, and dataset will be publicly available.

An overview of the MMFace4D dataset.

Capture Setup

We devise a capture system comprising three RGB-D cameras, one microphone, and one screen. Each camera is placed at the height of 1.2 meters. One camera shoots at the front of the face, the other two cameras shoot at the left and right sides with 45 degrees of angle. The cameras are accurately aligned. We leverage Azure Kinect Camera to capture RGB-D video. We record RGB video with a resolution of 1920 * 1080, and record depth video with a resolution of 640 * 576.
    Before recording, we built a large-scale corpus with 11,000 sentences under different scenarios such as news broadcasting, conversation, and storytelling. Each sentence has an emotion label of seven categories (neutral, angry, disgust, happy, fear, sad, surprise). For the neutral emotion, we have 2000 sentences. For the other emotions, we have 1500 sentences. Each sentence of the corpus has 17 words on average. Our corpus covers each phoneme as evenly as possible.

Comparison of 4D (3D sequence) face datasets. MMFace4D dataset has a competitive scale in terms of subject number (#Subj), corpus scale (#Corp), sequence number (#Seq), and duration (#Dura). Additionally, the frame per second (FPS), the emotion label (Emo), the spoken language (Lang), and the presence of topology-uniformed meshes (Mesh) are also listed.

Tool Chain

The tool-chain reconstructs topology-uniformed 3D Face sequences from the RGB-D videos. We take the basel face model (BFM) as a template 3D face and deform the template to fit RGB-D videos. We construct two levels of deformation space for each 3D face: (1) the first level is the 3DMM parameters, and (2) the second level is the 3D face vertices. We design a multi-stage pipeline to fit the 3D Faces. The pipeline has three stages: (1) initialization, (2) 3DMM parameter fitting, and (3) vertex-level fitting.

The 3D face reconstruction pipeline.

Dataset Download

This dataset is available only for the academic use. For privacy protection, only the landmarks, speech audio, depth data, camera intrinsics, and reconstructed mesh are publicly available. The users must sign the eula form and send the scanned form to wuhz19 [at] mails.tsinghua.edu.cn. Once approved, you will be supplied with a download link.
The eula form.
To preprocess our dataset, please see the README.md in our code.

BibTeX


        @article{wu2023mmface4d,
            title={MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation},
            author={Wu, Haozhe and Jia, Jia and Xing, Junliang and Xu, Hongwei and Wang, Xiangyuan and Wang, Jelo},
            journal={arXiv preprint arXiv:2303.09797},
            year={2023}
          }