Veo3 vs Wan2.2 vs Sora2: Zero-Shot Video Generation Comparison

Oct 26, 2025

TL;DR

I read about this interesting paper on Simon Willison's blog - Video models are zero-shot learners and reasoners where the researchers performed multiple tests using the Veo3 model. Their conclusion: video generation models can act as zero-shot vision foundation models. Think GPT-3, but for vision. We ran additional experiments with Veo3, Wan2.2, and Sora2 to see if this emergent behaviour is unique to Veo3 or something broader. Turns out, all models show impressive capabilities in perception, modelling, and manipulation tasks, but Veo3 consistently outperforms on reasoning tasks. Whether that's due to the model itself or the Gemini-2.5-pro prompt rewriter remains an open question.

Introduction

If you've ever wondered how those AI-generated videos on social media come to life, here's the quick version: video generation models take a piece of text (sometimes paired with an image) and generate short videos, typically under 10 seconds, based on that input.

The architecture behind these models is fascinating. They're built on diffusion-based systems, which are essentially an extension of what powers image generation models like Stable Diffusion. The key difference? A 3D convolutional VAE that encodes not just a single image but multiple frames from a video - ensuring that the temporal information of videos can be compressed in the latent representation. In other words, it understands how things change over time, not just how they look in a single frame.

Here's how it works: typical diffusion models start with a noisy image and, in a step-wise fashion, remove the noise to generate a clean, coherent image. During this denoising process, you can add conditional elements that influence the outcome. For example, if you provide a text prompt, the model removes noise in a way that ensures the final image aligns with that prompt. Video generation models take this concept further. Instead of conditioning on just a text prompt, they also factor in a starting image and previous frames in the video sequence.

Picture it like this: you start with a noisy image, a text prompt, and maybe an initial image. The model generates the first frame by denoising in the style dictated by the text and image. That becomes frame one. For frame two, the model repeats the denoising process, but now it conditions on the text prompt and the frame it just generated. This continues for every subsequent frame. Each new frame is influenced by the one before it, creating a coherent sequence that flows naturally from start to finish.

It's a clever approach that allows these models to generate videos that feel cohesive and, in many cases, surprisingly realistic.

Paper Review

The Veo3 paper puts forward a revolutionary idea: although the model was never explicitly trained to perform vision-based tasks like image segmentation, it somehow manages to pull them off. Remember, as we discussed in the introduction, this model is trained to predict the next frame in a sequence through a denoising process. Yet, the researchers found it could perform several tasks like image segmentation without any specific training for these capabilities. They draw a compelling parallel to GPT-3 for vision.

I remember this comparison vividly because I used GPT-3 extensively to perform various specific NLP tasks like Named Entity Recognition, even though the model itself was trained only to predict the next token. The revolutionary finding back then was that even though GPT-3 wasn't explicitly trained to do tasks like summarisation or Names Entity Recognition, you could achieve these results just by prompting it correctly. In the same way, Veo3 is only trained to predict the next frame, but by prompting it appropriately, you can use it to perform image segmentation, simulate physics, and more. This indeed seems amazing.

But I wanted to take a more critical look as well. Is there any other explanation for why the model can perform these tasks? Could there be data leakage?

One explanation that came to mind is synthetic data. When this model was trained, it's likely that a significant amount of synthetic data was used. And it's relatively easy to generate synthetic data from a single image by performing multiple perturbations and combining them into a sequence. For example, take a single image and apply varying degrees of opacity from 1% to 98% to it. Combine all these individual pictures to form a video and attach a text prompt like The image becomes slowly brighter from a completely dark image. There could be countless such examples that the video generation model has seen during training, which might explain why it's able to perform these tasks.

That said, I believe that while synthetic data could explain some of the results, it doesn't explain everything. For example, the model correctly simulates how different objects fall on Earth versus the Moon. It's harder to think of specific synthetic training data that would teach this behaviour. The results also show some level of intelligence in geometry, path-finding, and other areas that are also harder to explain with synthetic data examples alone. It's more likely that this behaviour or knowledge is learned through actually watching several videos of real-world phenomena.

I also wondered whether the chain of frames analogy is too convenient. The authors hypothesise that just like we have chain of thought in the case of LLMs, video models likely follow a "chain of frames" which enables this reasoning behaviour. In the case of text, chain of thought and reasoning traces are a way to display (via text) how to think through a problem. Maybe you could say the same about chain of frames; it's a way to show the model, not via text reasoning traces but via a sequence of images, what actually happens and how physics works. This seems like a compelling way to think about it.

I also appreciate how the authors provide examples of where the intelligence of this model fails. These are well laid out in the appendix and help demonstrate that not everything is perfect. It's refreshing to see this level of transparency in research.

Further Experiments

The question I wanted to answer was this: is this zero-shot behaviour unique to Veo3, or do other video generation models show it too? When using Veo3, there's a prompt rewriter that's part of the system, so it's unclear how much influence that has and how much intelligence can be attributed to the video model alone. If we observe similar behaviour in other video generation models (which might also include varying degrees of prompt rewriting, but typically less involved), then we can be more confident that this is a general emergent phenomenon.

Before diving into the results, I should note a few caveats.

  1. We didn't follow the rigorous process of 12 prompts per task that the paper authors used. This was mostly due to cost and time constraints.

  2. In some cases, we also couldn't generate Sora2 output for reasons like API failures.

  3. We generated shorter videos for Wan2.2 and Sora2 compared to the longer 8-second videos for Veo3, again mostly due to cost constraints.

You can find the consolidated results of all experiments on this page.

Perception

I found that all the models performed well on perception tasks. Wan2.2 and Sora2 matched Veo3 in executing tasks like image segmentation, de-blurring, super-resolution, and edge detection. What surprised me was that Wan2.2 actually performed better than Sora2 on more complicated perception tasks like conjunctive search, the binding problem, and the Rorschach blot interpretation. However, Wan failed to correctly interpret the Dalmatian illusion, where the other models succeeded.

Modeling

This is where things start to get more fuzzy. All three models still got a lot right: physics body transforms, order of objects (Visual Jenga), color mixing, and material optics. The last one, material optics, was particularly impressive to me. All models did a really good job, though at varying levels of aesthetics. They all understand the reflection patterns of the same room on a mirror sphere as well as a glass sphere. The ability to retain memory of world states and character recognition were particular examples where I found that Wan2.2 equally matched Veo3 results.

Things start to go wrong a bit when it comes to draping a scarf on a vase. Here, both Wan2.2 and Sora2 make mistakes or produce results that look implausible. Similarly, when testing buoyancy patterns, I noticed some hallucinations, although the splash patterns in the water looked convincing.

Manipulation

In this category too, I found that Wan2.2 mostly matches up with Veo3. Sora2 particularly falls behind in some cases, for example in text manipulation tasks. One category of videos to do with robot hands and manipulation is where Veo3 really shines. It's able to get dexterous variations quite spot-on while Wan2.2 and Sora2 models fail. This is clearly visible when getting robot hands to open a jar or throw a ball. My hypothesis is that Veo3 was trained on multiple robotic manipulation videos, which could explain this advantage.

A particularly interesting example is the rolling of a burrito, where each model follows a slightly different approach, but one can see that the final burrito is rolled and ready to eat in all cases!.

Reasoning

This is where I felt like Veo3 has a distinct advantage. It gets a lot of the examples in this category right, particularly difficult ones like maze solving. Both Wan2.2 and Sora2 fail on most tasks here. I don't know how much of Veo3's performance is due to the prompt rewriter component that uses gemini-2.5-pro as the LLM. As in the previous example, Veo3 gets robot navigation spot-on, lending further credibility to the hypothesis about the distribution of training videos in the robotic domain.

One example that Wan2.2 and Sora2 get right is whether a golf ball fits into a vase. What's particularly funny is the audio component of these videos, where you can hear the disappointments like "Oh no, no chance," which kind of confirms that the task is not possible. Veo3, however, forces the golf ball through the vase, which is of course implausible.

Overall Conclusion

Based on these experiments, we can clearly see that video generation models do seem like zero-shot learners, though not to the same extent across all tasks. This lends credence to the comparison with GPT-3, because I remember that at that time too, we had to try multiple prompt variations to get reliable responses, and final selections were based on majority voting. These issues were eventually resolved in later versions through techniques like instruction fine-tuning and RLHF. I wonder what techniques will be incorporated into the next versions of video generation models to improve their consistency and reliability.

Based on these experiments, it also becomes clear that this behaviour is not specific to just Veo3. Both Wan2.2 and Sora2 show similar capabilities, which suggests this is an emergent property of video generation models in general, rather than something unique to Veo3's architecture.

While all models perform similarly in perception, modelling, and manipulation tasks, Veo3 seems to do particularly well in reasoning, whereas the other two models fail. While the authors try various ways to check for bias, I do get the feeling that having gemini-2.5-pro as the prompt rewriter definitely plays a role in this superior performance. It's hard to separate how much of the reasoning capability comes from the video model itself versus how much comes from having a sophisticated language model rewrite the prompts before they're fed into the system.

That said, the fact that all these models show zero-shot capabilities across multiple domains is exciting. It suggests we're seeing the early stages of video models developing a more general understanding of the world, similar to what happened with large language models. The question now is: how do we push these capabilities further, and how do we make them more reliable and consistent across different types of tasks?