Artificial intelligence is getting better and better at generating an image in response to a handful of words, with publicly available AI image generators like DALL-E 2 and Stable Diffusion. Now Meta researchers are taking AI a step further: They’re using it to make up videos from a text message.
Mark Zuckerberg, CEO of Meta posted on Facebook Thursday about the investigationI call Make a video, with a 20-second clip compiling various text prompts used by Meta researchers and the resulting (very short) videos. Prompts include “A teddy bear painting a self-portrait”, “A spaceship landing on Mars”, “A lazy baby in a knitted hat trying to figure out a laptop”, and “A robot surfing an ocean wave”.
The videos for each message are only a few seconds long, and usually show what the message suggests (with the exception of the baby sloth, who doesn’t look much like the real thing), in fairly low resolution and somewhat choppy. style. Still, it demonstrates a new direction AI research is taking as systems get better and better at generating images from words. However, if the technology is eventually released widely, it will raise many of the same concerns raised by text-to-image systems, such as that it could be used to spread misinformation via video.
A Web page for Make-A-Video includes these short clips and others, some of which look quite realistic, such as a video created in response to the prompt “Clown fish swimming in the coral reef” or one meant to show “A young couple walking in heavy rain.”
In his Facebook post, Zuckerberg pointed out how complicated it is to generate a moving image from a handful of words.
“It is much more difficult to generate videos than photos because in addition to correctly rendering each pixel, the system also has to predict how they will change over time,” he wrote.
a research paper Describing the work, it is explained that the project uses a text-to-image AI model to figure out how words correspond to images, and an AI technique known as unsupervised learning — in which algorithms painstakingly analyze unlabeled data to discern patterns within it — to watch video and determine what realistic motion looks like.
As with the massively popular AI systems that generate images from text, the researchers noted that their text-to-image AI model was trained on data from the internet, meaning it learned “and likely exaggerated social biases.” , including harmful ones,” the researchers wrote. They noted that they filtered data for “NSFW content and toxic words,” but since data sets can include many millions of images and text, it may not be possible to remove all of that content.
Zuckerberg wrote that Meta plans to share the Make-A-Video project as a demo in the future.