The new AI tool that generates lifelike talking faces, made Mona Lisa a rap artist and people are baffled.
AI has become a fundamental component in nearly every aspect of our daily lives. The use of AI-powered smart devices has become an inevitable part of our daily routines. Despite its benefits, the rapid expansion of AI has sparked concerns due to its potential for manipulation. Nevertheless, experts continue to refine AI technologies and develop innovative tools. For instance, Microsoft's VASA-1, a new AI tool that generates lifelike audio-driven talking faces in real-time, has managed to make Mona Lisa a rapper. AI educator Min Choi–who goes by @minchoi on X–shared the video, which has been going viral on the platform.
Choi mentioned in the captions, "Microsoft just dropped VASA-1. This AI can make a single image sing and talk from audio reference expressively." The spooky video shows the 15th-century painting "Mona Lisa" coming to life and singing "Paparazzi," a song written and performed by Anne Hathaway on Conan O'Brien's talk show in 2011. The AI tool has reimagined how Mona Lisa's facial features would move when she raps the song. What was once a painting of a woman with a vacant expression suddenly turned into a lively and comical rapper with wide eyes and synchronized lip movements.
In barely a week, Choi's post gained wide popularity, with over 7 million views and counting. People were baffled by the tech giant's creation, which stemmed from research by Microsoft Research Asia published on April 16. Given a static image along with an audio clip, VASA-1 will use its visual affective skills (VAS) to produce lip movements, facial expressions and nuanced head movements that are in sync with the audio. Microsoft has innovated diffusion-based holistic facial dynamics and head movement generation models in this AI tool, which makes the end results look close to reality. So, a lot of research and meticulous testing has gone into making Mona Lisa look like a natural rap artist in the video.
"This is wild, freaky and creepy all at once," exclaimed @artofnikita. "These tools perhaps hold an interesting key to the future of Virtual & Augmented Reality. Now that we can convincingly cross the uncanny valley and render a 2D image of a person into a 3D representation, we're a step closer to being able to turn a complex single-camera video (or still!) into a fully immersive environment with six degrees of freedom," pointed out @7LukeHallard. "This is one of the craziest things I've come across. AI's here to stay," said @TheoCartz.
5. Controllability of generation 1
— Min Choi (@minchoi) April 18, 2024
Example of eye gaze direction and head distance, and emotion offsets pic.twitter.com/Qk2jczNIpU
Beyond the rapping Mona Lisa, Microsoft has also unveiled a series of sample videos demonstrating VASA-1's capabilities. These videos showcase the AI's ability to achieve remarkable levels of realism and vitality. Thanks to its out-of-distribution generalization, this Microsoft AI tool can adeptly process any form of image and audio input, from artistic photos to vocal performances and multilingual speech. Also, VASA-1's power of disentanglement disentangles appearance, 3D head pose and facial dynamics. This way, the AI tool effectuated individual attribute control and editing of the generated content, producing almost lifelike videos.
Microsoft just dropped VASA-1.
— Min Choi (@minchoi) April 18, 2024
This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba
10 wild examples:
1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD