AI video generators Scientists cannot understand the laws of physics just by watching videos.
AI video generators come on the heels of chatbots and image generators Sora And Track are already delivering impressive results. But a team of scientists from Bytedance Research, Tsinghua University and Technion were curious whether such models could discover physical laws from visual data without any additional human input.
While in the real world we understand physics through mathematics, in the world of video generation an AI model that understands physics should be able to look at a series of frames and then predict which ones are next. This should happen both when it comes to images that the AI model has seen before, but also when it concerns unknown images.
To find out if this insight exists, the scientists created a 2D simulation using simple shapes and movements and created hundreds of thousands of mini videos in which to train and test their model. They discovered that the models could ‘mimic’ physics, but did not understand it.
Is SORA really a world model? – YouTube
Look
The three fundamental physical laws of simulation that they chose to study were the uniform linear motion of a ball, the perfectly elastic collision between two balls, and the parabolic motion of a ball.
Based on the team’s preprinted paperfound that while the shapes functioned as they should for simulations based on the data they were trained on, they did not work well in new, unforeseen scenarios. At best, the models tried to mimic the training example they could find.
During their experiments, the scientists also noticed that the video generator often changed one shape into another (for example, a square randomly turns into a ball) or made other nonsensical adjustments. The model’s priorities seemed to follow a clear hierarchy, with color having the highest importance, followed by size and then speed. Form received the least emphasis.
Have they found a solution?
“It is challenging to determine whether a video model has learned a law rather than just memorized the data,” the researchers said. They explained that since the model’s internal knowledge is inaccessible, they could only derive the model’s understanding by examining its predictions on unseen scenarios.
“Our in-depth analysis suggests that generalization of video models depends more on referencing similar training examples than on learning universal rules,” they said, emphasizing that this happens regardless of the amount of data a model trains on.
Have they found a solution? Not yet, lead author Bingyi Kang further wrote X. “Actually, this is probably the mission of the entire AI community,” he added.