#TRENDING

OpenAI Text-to-video Model Sora Wows X but Still Has Weaknesses

• Trending

Artificial intelligence firm OpenAI unveiled its first-ever text-to-video model to a strong reception on Thursday, though the firm admits the model still has a ways to go.

OpenAl unveiled the new generative Al model, dubbed Sora, on Feb. 15, which is said to create detailed videos from simple text prompts,

continue existing videos, and even generate scenes based on a still image.

Introducing Sora, our text-to-video model.Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions. https://t.co/7j2JN27M3WPrompt: "Beautiful,

snowy... pic.twitter.com/ruTEWn87vf

-OpenAI (@OpenAI) February 15, 2024

According to a Feb. 15 blog post, OpenAI claimed the Al model can generate movie-like scenes in up to resolution up to 1080p. These scenes can include multiple characters, specific types of motion, and accurate details of the subject and background.

How Sora works

Much like OpenAI's image-based predecessor

DALL-E 3, Sora operates on what's known as a

"diffusion" model.

Diffusion refers to a generative AI model creating its output by generating a video or an image with something that looks more like "static noise" and then gradually transforming it by "removing the noise" over several steps.

Announcing Sora - our model which creates minute-long videos from a text prompt:

https://t.co/SZ3OxPnxwz

- Greg Brockman (@gdb) February 15, 2024

The AI firm wrote that Sora has been built on past research from both GPT and DALL-E3 models, something the firm claims makes the model better at more "faithfully" representing user inputs.

OpenAI admitted that Sora still contained several weaknesses, and could struggle to accurately simulate the physics of a complex scene, namely by muddling up the nature of cause and effect.

"For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark."

The new tool can also confuse the "spatial details" of a given prompt by mixing up