What is GUESS?
GUESS, an abbreviation for Gradually Enriching Synthesis, is a diffusion based generative framework instructed to visualize human motion when provided with text. The algorithm can be simplified to a translation from text to video. It starts with a course model and gradually adds complexity to said model, that is why it is named Gradually Enriching Synthesis. Also, it is called diffusion based since it generates many answers and uses them to steadily get the perfect result. On a side note, GUESS is also integrated with a dynamic fusion mechanism to balance the given text prompts.
How GUESS works?
The nature of the task is formidable: words may not be able to precisely describe what is wanted, like the magnitude of the action. Therefore, a delicate solution is required. The program can create 4 different types of skeletons each with different amounts of detail: S1, S2, S3, and S4. Each type of skeleton is produced many times to give feedback and increase its precision. When the text is given, a predetermined number of rough estimations of the pose (S1) are made. Then the position data from S1 is sent to S2, S3, and S4. S2 and S3 are made which feature body parts in contrast to S1. Further information received from S2 and S3 such as 3D rotation, kinematic chains, velocity, and foot contact based on their joint positions are extrapolated to S4. S4 is a very rough drawing and shows the trajectory of the motion sequence. Each of the stages is sent to their own denoisers and compiled to a 3D object. Finally, diffusion is used to turn the 3D object into a human-like form. This cycle is repeated to create motion.
GUESS, along with several other relevant software, was experimented on to test its performance. The experiments were conducted using multiple databases: HumanML3D, a commonly used large dataset for human motion; KIT-ML, a dataset containing raw motion samples; HumanAct12, a data set specialized in daily indoor activities; UESTC, the biggest dataset with over 72.000 human motion samples. All the selected programs were run through the datasets and evaluated by OpenAI’s CLIP on how correct the motion is, how many words it understood, diversity between different prompts, and accuracy. GUESS outperforms state-of-the-art methods by an astounding margin in most of the categories. For example, it passed M2T-GPT in every metric measured.
The future of GUESS
GUESS is a big advancement in human animation. This can be a direct improvement in the world of VR, game design, and cinematography. Even though, GUESS can execute these, it has limitations. Currently, GUESS is a static network. It has a fixed number of four scales, but scientists are planning to make it sample dependent, changing the scale amount based on the action. Making it a dynamic system will help get much quicker and higher quality results. Secondly, the researchers are hoping to further develop the guessing method by making GUESS generate a low-resolution estimate at first and adding detail as the process continues.