DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

1IEIT, 2Nankai University, 3Tsinghua University,
*Equal Contribution Corresponding Author

Abstract

Integral Spatio-temporal Consistency
Examples of spatio-temporal consistency

Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation.





DropletVideo-10M Dataset

Panda-70M Caption: A person is holding a long haired dachshund in their arms.
DropletVideo-10M Caption: This video captures a scene of a man walking on a city street at night. The lighting is dim, but the background streets and buildings remain clearly visible.
The video begins on a nighttime city street, where a man wearing a T-shirt with a colorful pattern and a clip-on microphone appears in front of the camera. His face is blurred. In the background, there are shop windows displaying colorful merchandise, and across the street, there is a roadway with vehicles moving slowly. Streetlights and headlights provide faint illumination to the street.
As the man walks while facing the camera, more details of the buildings in the background become visible. A blue sedan passes by on the street, and the shadows of the vehicles flicker on the ground under the lights.
Then, the camera pans to the right, revealing a new scene. Another man wearing a black T-shirt enters the frame, walking near the entrance of a store that emits a bright white light from above. At the same time, pedestrians on both sides of the street come into view, and their shadows on the ground become more distinct.
As the scene transitions, the camera captures a brightly lit urban district with heavy traffic. A blue SUV is seen queued behind a silver car as vehicles move forward slowly. At this moment, the main subject is shown from behind, walking along a crowded sidewalk. The background consists of trees and building facades adorned with green plants inside the walls.
Following the pedestrian’s movement, the camera continues along the street, where traffic remains steady. There are many parked cars along the roadside, including a black sedan.
Towards the end of the video, the man continues walking along the same sidewalk. The background features a row of shops, with customers lingering outside and chatting. The surroundings remain lively with the bustling city atmosphere under the night sky.
Finally, the camera pulls back towards the side of the street, showing the opposite side still busy with traffic and the flashing city lights.

The DropletVideo-10M dataset features diverse camera movements, long-captioned contextual descriptions, and strong spatio-temporal consistency. Existing datasets, such as Panda-70M, place less emphasis on camera movement and contain relatively brief captions. In contrast, DropletVideo-10M consists of spatio-temporal videos that incorporate both camera movement and event progression. Each video is paired with a caption that conveys detailed spatio-temporal information aligned with the video content, with an average caption length of 206 words. The spatio-temporal information is highlighted in red in the figure.





DropletVideo Method Overview

The video is processed by the 3D causal Variational Autoencoder (VAE) following adaptive equalization sampling, which is steered by the motion intensity M. The video feature xv is then input into the Modality-Expert Transformer, depicted on the right side of the figure, to facilitate video generation in conjunction with the text encoding xt , the combined encoding xT&M of the temporal T and the motion intensity M. The upper left part illustrates the contrast between (a) the traditional sampling approach and (b) DropletVideo's adaptive equalization sampling. Traditional methods involve random segment interception followed by fixed-frame-rate sampling of the intercepted segments, whereas DropletVideo employs adaptive frame rate sampling across the entire video segments, guided by M.





Integral Spatio-temporal Consistency

DropletVideo focuses on integral spatio-temporal consistency during video generation. It addresses the spatial distortion issues caused by camera movement, ensuring smooth plot progression during camera movement and the spatio-temporal consistency of objects within the scene. More importantly, in the development of a video scenario, the emerging scenes do not affect the behavior of the original video objects.





High controllability of Emerging objects









3D Consistency

Trained on the large-scale spatio-temporal dataset, DropletVideo-10M, DropletVideo exhibits remarkable 3D consistency. In the following example, the camera rotates around a snowflake, maintaining stringent consistency for both the background and the snowflake from various angles, while preserving the snowflake’s intricate details across multiple perspectives. In the bottom example, the camera performs an arc shot, projecting the same object. Despite not being specifically designed for arc shots, DropletVideo effectively maintains the insect’s 3D consistency over a broad range of rotation angles, demonstrating robust spatial 3D continuity.






Controllable Motion Density

DropletVideo manages the pace of plot advancement and camera angle shifts by adjusting a motion control parameter. In the exemplified scenario, augmenting this parameter permits a video of equivalent length to incorporate a greater number of plot components. The video above showcases the video generation results under varying motion control parameters using identical text-image input. Under fps=8 setting, the camera's motion is apparently larger than fps=12 and fps=16, where the snowflake is seen in a perspective with a larger wide range of changes. The motion density decreases as the fps increases from 8 to 16, which confirms that using a smaller fps results into a video with much severe camera variations. This indicates that DropletVideo can proficiently modulate the playback speed of the content while preserving semantic precision.





Camera Motion

DropletVideo demonstrates versatile camera motion generation capabilities including various fundamental movement types. The system produces cinema-standard motions including right/left trucking, vertical pedestal movement, tilt adjustment, axial dollying, and composite pan-tilt operations.










Comparison of our DropletVideo with existing Models

To better demonstrate the cumulative spatiotemporal consistency of DropletVideo, we have selected several industry-recognized video generation models for comparison, including Hailuo, Kling v1.6, Gen-3, Vidu, Vivago, Qingying, CogVideoX-Fun, and WanX. Out of the compared models, only CogVideoX-Fun and WanX are open-source, similar to our approach, whereas the remaining models are closed-source. We conducted comparisons using examples from various scenarios mentioned earlier, such as boat, kitchen, lake, snow, staircase, and sunset.


Example 1:boat


Example 2:kitchen


Example 3:lake


Example 4:snow


Example 5:staircase


Example 6:sunset

BibTeX

@article{zhang2025dropletvideo,
        title={DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation},
        author={Zhang, Runze and Du, Guoguang and Li, Xiaochuan and Jia, Qi and Jin, Liang and Liu, Lu and Wang, Jingjing and Xu, Cong and Guo, Zhenhua and Zhao, Yaqian and Gong, Xiaoli and Li, Rengang and Fan, Baoyu},
        journal={arXiv preprint arXiv:2503.06053},
        year={2025}
      }