Decoding Sora - The Viral Video AI of OpenAI

Sora does not create videos by stitching together multiple images, but rather renders pixels in real-time based on an understanding of physical motion.

Sora by OpenAI has been praised by experts as the leading AI generative tool producing the highest quality video content available today. “Sora marks a leap forward in the field of text-to-video conversion.”, ABC News commented.

Meanwhile, Time reported that before Sora emerged, there were already AI models like Runway and Pika capable of creating videos. However, their weaknesses included poor video quality and short duration. In contrast, Sora can generate 60-second videos with complex contexts while still ensuring smoothness and logic, even though some errors remain.

Video generated by Sora from text: A paper airplane flying through the jungle, weaving around the treetops as if they are migrating. (Source: OpenAI).

Breakthrough by OpenAI

OpenAI has not publicly disclosed its text-to-video model to the public. In its description, the company also mentions very little about the technology behind it and the data sources used for training.

“Sora uses a diffusion model, creating videos by starting with a noisy, low-resolution video and then removing the noise through multiple steps until the output is satisfactory,” the company behind ChatGPT explained how Sora works. This allows the AI to create entire videos simultaneously rather than in short segments that are then stitched together like other tools. The algorithm enables the model to predict multiple frames at once to ensure that the subject remains intact while other details are recreated.

Simulation of how Sora generates video by removing noise through an algorithm. (Source: Medium).

According to OpenAI, Sora is built upon previous research on image-generating AI like Dall-E and text-generating models like ChatGPT. However, Dr. Jim Fan, a senior AI researcher at Nvidia, noted: “If you still think Sora is just a creative toy like Dall-E, think again. It is a physics-based model that relies on data, capable of simulating both the real and virtual worlds.”

He emphasized that Sora is a diffusion transformer model from start to finish. Its secret lies in its deep understanding of text before converting it into 3D image forms. From here, the model continues to make predictions based on physical motion rules to transform each pixel for the video to the highest possible accuracy.

“Sora’s simulator does not just rely on learned data; it can also self-train, finding the most accurate results to continue creating,” Fan analyzed. He believes what sets Sora apart is that it does not create videos by piecing together discrete image sequences but rather renders a set of pixels in real-time.

Sora generates 5 videos simultaneously based on the request to describe a scene from 5 different perspectives. Author Bill Peebles noted that he did not intervene, and the AI independently assembled a complete film.

This has led experts to recall the AI math model announced by three Vietnamese-origin doctors in the scientific journal Nature last month. In its technical description of how Sora operates, OpenAI also stated that this video-generating model would play a foundational role in enabling AI to understand and simulate the real world.

“We believe this will be a significant milestone in achieving AGI”, OpenAI declared.

Weaknesses of Sora

According to Medium, converting text into video is a challenging task because it requires the AI to understand the meaning and context of the text as well as various aspects of images, videos, and physical motion. One reason OpenAI has limited Sora to a small group for trial use is that it still has some shortcomings.

“Sora may struggle to accurately simulate the physical properties of a complex scene. It may not correctly understand cause-and-effect statements,” OpenAI admitted.

The company cited an example where Sora can create a video of a person biting into a cookie, but afterwards, the cookie remains whole with no bite taken. It may also confuse details such as left and right, or front and back, for example, depicting a man running backward on a treadmill.

Sora depicts a man running backward on a treadmill. (Source: OpenAI).

However, analysts argue that the greatest concern regarding Sora lies in the very breakthrough of OpenAI. The videos generated are so realistic that many fear the model could be misused to spread misinformation, violate privacy, promote racism, and even influence election outcomes. While the company prohibits the use of Sora for creating harmful content, it has yet to find a way to distinguish between AI-generated and real images for labeling and classification.

Fred Havemeyer, head of AI research at Macquarie, believes that Sora’s incredible capabilities will raise significant ethical concerns and societal impacts. He stated that the negative effects of AI will be the most debated topic in 2024, and Sora is just the beginning.

According to New York Times, OpenAI still closely guards information about the content used to train Sora, including how much of it is copyrighted. “They may want to keep it secret to maintain a competitive edge, but they might also fear lawsuits related to copyright issues, similar to the troubles ChatGPT is facing,” the publication noted.

Nonetheless, analysts agree that Sora is ushering in a new era of AI-generated video, similar to the way ChatGPT emerged. Once officially commercialized, it could directly impact the film, media, and game design industries.

Reece Hayden, a senior analyst at ABI Research, stated on CBS News that in the future, AIs like Sora could even change how platforms like Netflix operate, allowing users to modify story endings or create their own movies just by providing text prompts.