Users will soon witness VideoPoet and its successors produce astonishingly realistic videos.
Animating the Mona Lisa painting from two prompts: “A woman looking at the camera” and “A woman yawning”.
As the wave of generative AI text models stabilizes with increasingly refined products, a new wave named “AI video generation models” is beginning to flourish. However, these types of models face significant challenges in producing a series of reasonable movements in the eyes of viewers.
Over time, these models will learn more, resulting in higher quality and more realistic products. Their advantage lies in a relatively simple production process, requiring only clever commands to instruct the AI to generate videos or similar products. In addition to being a versatile AI model that can create videos from instructions, generate videos from images, and stylize videos, etc…
Currently, OpenAI’s software Sora has attracted public attention by releasing a series of astonishingly realistic AI-generated videos, but they are not alone in the quest for artificial intelligence research. Google also has its own similar project called VideoPoet, which has been under development for some time and has produced impressive results.
Video from the prompt: “Two pandas playing cards”.
Video from the prompt: “A horse galloping against the backdrop of Van Gogh’s Starry Night”.
According to claims from Google researchers, input images can be animated to create movement, VideoPoet can also fill in missing content (such as restoring the original video) or generate additional content for the video.
In the task of stylization, the AI model uses videos that describe depth and optical effects, which can showcase movement, and then adds content on top to create a style based on user instructions. Below is a product after stylizing a video also generated from Google’s AI model.
Video prompts (from left to right): “A wombat wearing sunglasses holding a volleyball on the beach”; “A teddy bear ice skating on a frozen lake”; “A metal lion roaring in the light of a forge”.
Based on the last second of the video, the AI model can create longer videos by predicting the content that may occur in the next second. By repeating this process, VideoPoet can not only easily extend videos but also maintain the form of the objects appearing in the short clips.
Video from the prompt: “An astronaut begins to jump on Mars. Then, brilliant fireworks explode from behind.
VideoPoet is also capable of generating sound. For clips lasting 2 seconds, the AI attempts to predict sounds without requiring text prompts. This allows for the creation of videos and sounds from a single sample.
Sound generated from the content of a teddy bear drumming.
Sound generated from the content of a cat playing piano.
Through VideoPoet, Google demonstrates the extremely competitive quality of large language models, not only generating text content but also creating eye-catching, realistic videos.
The results indicate the promising potential of large language models in the field of video generation. In the future, such AI models could produce content based on various input prompts, such as generating sound from text, creating videos from speech, automatically describing videos, and many other applications.