OpenAI’s Sora text-to-video tool’s impact will be ‘profound’

OpenAI is not the first to offer generative AI technology that can transform a text prompt into realistic video, but its tool appears to be among the most advanced to date.

OpenAI last week unveiled a new capability for its generative AI (genAI) platform that can use a text input to generate video — complete with life-like actors and other moving parts.

The new genAI model, called Sora, has a text-to-video function that can create complex, realistic moving scenes with multiple characters, specific types of motion, and accurate details of the subject and background “while maintaining visual quality and adherence to the user’s prompt.”

Sora understands not only what a user asks for in the prompt, but also how those things exist in the physical world.

[ Prepare for generative AI with experimentation and clear guidelines ]
The technology basically translates written descriptions into video content, leveraging AI models that understand textual input and generate corresponding visual and auditory elements, according to Bernard Marr, a technology futurist and business and technology consultant.

“This process involves deep learning algorithms capable of interpreting text and synthesizing videos that reflect the described scenes, actions, and dialogues,” Marr said.

While not a new capability for AI engines offered by other providers, such as Google’s Gemini, Sora’s impact is expected to be profound, according to Marr.

Google’s Lumiere off-the-shelf text-based image editing methods can be used for video editing.

Like any advanced genAI technology, he said, Sora’s impact will help reshape content creation, enhancing storytelling and democratizing video production.

“Text-to-video capabilities hold immense potential across diverse fields such as education, where they can create immersive learning materials; marketing, for generating engaging content; and entertainment, for rapid prototyping and storytelling,” Marr said.

However, Marr warned, the ability for AI models to translate textual descriptions into full-fledged videos also underscores the need for rigorous ethical considerations and safeguards against misuse.

“The emergence of text-to-video technology introduces complex issues regarding copyright infringement, particularly as it becomes capable of generating content that might closely mirror copyrighted works,” Marr said. “The legal landscape in this area is currently being navigated through several ongoing lawsuits, making it premature to definitively state how copyright concerns will be resolved.”

Potentially more concerning is the ability of the technology to produce highly convincing deepfakes, raising serious ethical and privacy issues, underscoring the need for close scrutiny and regulation, Marr said.

Dan Faggella, a founder and lead researcher of Emerj Artificial Intelligence, did a presentation about deep fakes at United Nations five years ago. At the time, he emphasized that regardless of warnings about deep fakes, “people will want to believe what they want to believe.”

There is, however, a bigger consideration: soon, people will be able to live in genAI worlds where they strap on a headset and tell an AI model to create a unique world to satisfy emotional needs, be it relaxation, humor, action – all programmatically built specifically for that user.

“And what the machine is going to be able to do is conjure visual and audio and eventually haptic experiences for me that are trained on the [previous experiences] wearing the headset,” Faggella said. “We need to think about this from a policy standpoint; how much of that escapism do we permit?”

Text-to-video models can also build applications that conjure AI experiences to help people be productive, educate them, and keep them focused on their most important work. “Maybe train them to be a great salesperson, maybe help them write great code, and do a lot more coding than they can do right now,” he said.

Both OpenAI’s Sora and Google’s Gemini 1.5 multimodal AI model are for now internal research projects only being offered to a specific body of third-party academics and others testing the technology.

Unlike OpenAI’s popular ChatGPT, Google said, users can feed into its query engine a much larger amount of information to get more accurate responses.

Even though Sora and Gemini 1.5 are currently internal research projects, they showcase real examples and detailed info, including videos, photos, gifs, and related research papers.

Along with Google’s Gemini multimodal AI engine, Sora was predated by several text-to-video models, including Meta’s Emu, Runway’s Gen-2, and Stability AI’s Stable Video Diffusion.

The denoising process used by Stable Diffusion. The model generates images by iteratively clearing random noise until a configured number of steps have been reached; it’s guided by a CLIP text encoder pretrained on concepts along with the attention mechanism, creating an image depicting a representation of the trained concept.

Google has two concurrent research projects advancing what a spokesperson called “state-of-the-art in video generation models.” Those projects are Lumiere and VideoPoet.

Released earlier this month, Lumiere is Google’s more advanced video generation technology; it offers 80 frames per second compared to 25 frames per second from competitors such as Stable Video Diffusion.

“Gemini, designed to process information and automate tasks, offers a seamless integration of modalities from the outset, potentially making it more intuitive for users who seek a straightforward, task-oriented experience,” Marr said. “On the other hand, GPT-4’s layering approach allows for a more granular enhancement of capabilities over time, providing flexibility and depth in conversational abilities and content generation.”

In a head-to-head comparison, Sora appears more powerful than Google’s video generation models. While Google’s Lumiere can produce a video with 512×512-pixel resolution, Sora claims to reach resolutions of up to 1920×1080 pixels or HD quality.

Lumiere’s videos are limited to about 5 seconds in length; Sora’s videos can run up to one minute.

Additionally, Lumiere cannot make videos composed of multiple shots, while Sora can. Sora, like other models, is also reportedly capable of video-editing tasks such as creating videos from images or other videos, combining elements from different videos, and extending videos in time.

“In the competition between OpenAI’s Sora and startups like Runway AI, maturity may offer advantages in terms of reliability and scalability,” Marr said. “While startups often bring innovative approaches and agility, OpenAI, with large funding from companies like Microsoft, will be able to catch up and potentially overtake quickly.”