Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation.
The DropletVideo-10M dataset features diverse camera movements, long-captioned contextual descriptions, and strong spatio-temporal consistency. Existing datasets, such as Panda-70M, place less emphasis on camera movement and contain relatively brief captions. In contrast, DropletVideo-10M consists of spatio-temporal videos that incorporate both camera movement and event progression. Each video is paired with a caption that conveys detailed spatio-temporal information aligned with the video content, with an average caption length of 206 words. The spatio-temporal information is highlighted in red in the figure.
The video is processed by the 3D causal Variational Autoencoder (VAE) following adaptive equalization sampling, which is steered by the motion intensity M. The video feature xv is then input into the Modality-Expert Transformer, depicted on the right side of the figure, to facilitate video generation in conjunction with the text encoding xt , the combined encoding xT&M of the temporal T and the motion intensity M. The upper left part illustrates the contrast between (a) the traditional sampling approach and (b) DropletVideo's adaptive equalization sampling. Traditional methods involve random segment interception followed by fixed-frame-rate sampling of the intercepted segments, whereas DropletVideo employs adaptive frame rate sampling across the entire video segments, guided by M.
DropletVideo focuses on integral spatio-temporal consistency during video generation. It addresses the spatial distortion issues caused by camera movement, ensuring smooth plot progression during camera movement and the spatio-temporal consistency of objects within the scene. More importantly, in the development of a video scenario, the emerging scenes do not affect the behavior of the original video objects.
The video shows a pair of small boats floating peacefully on a tranquil lake, with a magnificent sunset sky as the backdrop. the boat on the right is slowly chasing the boat on the left, with a soft golden glow reflecting the afterglow of the setting sun. The camera slowly moves from right to left, gradually revealing more background details. The distant city skyline appears hazy and dreamlike under the sunset, with a few tall buildings faintly visible. On the left side of the frame, tree branches sway gently in the breeze, adding a touch of natural movement to the scene. As the camera continues to move left, another small boat is shown quietly moored on the water on the left side, contrasting sharply with the distant city buildings.
The video presents a serene and beautiful sunset scene, capturing a flock of birds soaring gracefully under the evening sun, creating a stunning visual. The sun is slowly descending towards the horizon, painting the entire sky in warm shades of orange and red. The clouds, illuminated by the sunset, glow in golden hues, adding to the magnificent scenery. At the center of the frame stands a solitary tree, its branches appearing particularly distinct against the backdrop of the setting sun. As the camera moves slowly, a rolling grassland gradually emerges on the left side of the frame. The grassland, bathed in the sunset’s afterglow, displays varying shades of light and shadow, adding a rhythmic natural beauty to the scene. As the camera continues to pan left, the flight path of the birds becomes increasingly visible, forming a bright arc under the glow of the sunset and enhancing the dynamic beauty of the composition. Further along, another tree appears in the frame, its silhouette sharply defined under the warm hues of the setting sun, with crisp and well-defined lines.
The video showcases a chef focusing on the process of cooking in a modern kitchen, with professional kitchen equipment behind him and a clean and tranquil surrounding environment. At the start of the video, the chef is wearing a tall white chef's hat, a black chef's coat, and a white apron, standing in front of the central kitchen counter. The camera focuses on the chef's skillful hands as he uses a bright knife to chop various fresh ingredients on the worktable. These ingredients include red tomatoes, yellow peppers, green cucumbers, and a tall green cauliflower. The vegetables are colorful and neatly arranged. In the background, you can see the metal exhaust hood and several modern stainless-steel kitchen appliances. The kitchen is empty except for the chef, who is working attentively. As the video progresses, the camera slowly pans to the right, and a red apple gradually enters the frame, which is very fresh.
The video showcases a chef focusing on the cooking process in a modern kitchen. Behind the chef are professional kitchen equipment, with the surrounding environment clean and tranquil. At the beginning of the video, the chef is wearing a tall white chef's hat, a black chef's coat, and a white apron, standing in front of the central workbench in the kitchen. The camera focuses on the chef's dexterous hands as he uses a bright knife to chop various fresh ingredients on the workbench. These ingredients include red tomatoes, yellow peppers, green cucumbers, and a tall green cauliflower. The vegetables are brightly colored and neatly arranged. In the background, the metal hood and several modern stainless steel kitchen appliances can be seen. The kitchen is only occupied by the chef, who is working attentively. As the video progresses, the camera slowly pans to the right, and a red apple gradually enters the frame, with many droplets of water, indicating its freshness.
The video showcases a chef's focused cooking process in a modern kitchen, with professional kitchen equipment visible behind him and the surrounding environment clean and tranquil. At the beginning of the video, the chef is wearing a tall white chef's hat, a black chef's coat, and a white apron, standing in front of the central kitchen counter. The camera focuses on the chef's dexterous hands as he uses a bright kitchen knife to chop various fresh ingredients on the worktable, which includes red tomatoes, yellow peppers, green cucumbers, and a tall green cauliflower. The vegetables are brightly colored and neatly arranged. In the background, the metal hood and several modern stainless steel kitchen appliances can be seen, with the chef working alone. As the video progresses, the camera slowly moves to the right, and a red apple gradually enters the frame, showing slight signs of spoilage with brown spots.
The video showcases a chef focused on the process of cooking in a modern kitchen. Behind the chef are professional kitchen appliances, and the surrounding environment is clean and tranquil. At the beginning of the video, the chef is wearing a tall white chef's hat, a black chef's coat, and a white apron, standing in front of the central workbench in the kitchen. The camera focuses on the chef's dexterous hands as he uses a bright knife to chop various fresh ingredients on the workbench. These ingredients include red tomatoes, yellow peppers, green cucumbers, and a tall green cauliflower. The colors of the vegetables are vibrant and neatly arranged. In the background, one can see the kitchen's metal exhaust hood and several modern stainless steel kitchen appliances. The kitchen is occupied solely by the chef, who is working intently. As the video progresses, the camera slowly pans to the right, and a few yellow bananas gradually enter the frame on the workbench. The bananas have minor signs of spoilage with a few black spots.
Trained on the large-scale spatio-temporal dataset, DropletVideo-10M, DropletVideo exhibits remarkable 3D consistency. In the following example, the camera rotates around a snowflake, maintaining stringent consistency for both the background and the snowflake from various angles, while preserving the snowflake’s intricate details across multiple perspectives. In the bottom example, the camera performs an arc shot, projecting the same object. Despite not being specifically designed for arc shots, DropletVideo effectively maintains the insect’s 3D consistency over a broad range of rotation angles, demonstrating robust spatial 3D continuity.
A tranquil and beautiful snow scene, with a delicate glass snowflake placed in the center on soft snow. The background is a vast snowy plain dotted with pine trees, and the afterglow of the setting sun in the sky sprinkles a gentle glow. The video begins with the glass snowflake in the center of the frame, with sunlight passing through its transparent body, making it shine with colorful light. The snowflake's design is detailed, with clear edges and corners. The camera slowly rotates to the right around the snowflake, the distant pine trees are naturally distributed, appearing somewhat bent under the weight of the snow on the layered slopes. The camera continues to slowly rotate to the right and around the snowflake, another mountain view gradually comes into sight, with a few tall pine trees standing on the hilltop. On the horizon, the sun is about to set, and the remaining light turns the sky from light blue to warm orange. The camera continues to slowly rotate to the right around the snowflake, finally, the frame stays on the central glass snowflake, where the distant mountain top meets the horizon, and sunlight reflects on the snow.
A rotating process of an orange 3D model, which is a cartoon-style insect, possibly an ant. The entire model rotates in a black background, displaying details from all angles. At the start of the video, the model faces the audience, revealing two antennas, a body divided into the head, thorax, and abdomen, and two eyes on each antenna. Its forelimbs and hind limbs are both extended outward. As it rotates, the model gradually turns to the left, showing its side. At this point, its body structure can be seen more clearly, including details of the back and abdomen. Continuing to rotate, the model turns to the back, revealing its back and tail. The back has obvious segmentation, while the tail gradually narrows. Then, the model continues to rotate, showing its side and front, revealing more details of its forelimbs and hind limbs, including the joints and ends of the limbs. Finally, the model turns to the front again, displaying its front details, including the position of the head and antennas.
Motion Intensity = 8
Motion Intensity = 12
Motion Intensity = 16
DropletVideo manages the pace of plot advancement and camera angle shifts by adjusting a motion control parameter. In the exemplified scenario, augmenting this parameter permits a video of equivalent length to incorporate a greater number of plot components. The video above showcases the video generation results under varying motion control parameters using identical text-image input. Under fps=8 setting, the camera's motion is apparently larger than fps=12 and fps=16, where the snowflake is seen in a perspective with a larger wide range of changes. The motion density decreases as the fps increases from 8 to 16, which confirms that using a smaller fps results into a video with much severe camera variations. This indicates that DropletVideo can proficiently modulate the playback speed of the content while preserving semantic precision.
DropletVideo demonstrates versatile camera motion generation capabilities including various fundamental movement types. The system produces cinema-standard motions including right/left trucking, vertical pedestal movement, tilt adjustment, axial dollying, and composite pan-tilt operations.
Camera Truck left . The video showcases a white off-road vehicle parked by the riverside, creating an atmosphere of outdoor adventure and harmony with nature. In the initial frame, the front of the white off-road vehicle occupies the right side of the frame, with its body covered in noticeable mud stains, emphasizing its rugged journey. Across the river, a dense, deep-green forest stretches out, with sunlit leaves brimming with vitality. As the camera slowly moves left around the vehicle, the view gradually expands, revealing that the vehicle is parked on a textured riverbank. The camera continues its leftward movement, unveiling a broader view of the river, where scattered stones dot the water’s surface. Sunlight dances on the rippling water, while the forest on the opposite bank gradually comes into view. The entire scene appears bright and layered, enhancing the sense of depth and natural beauty.
Camera Truck right . A macro shot captures two dewy green leaves on a rainy early morning. In the opening frame, the sharp-edged leaf on the left has crystal-clear droplets trickling down its edge. The smooth-surfaced leaf on the right holds several still water droplets, subtly deformed by gravity. The background is softly blurred into a hazy green, with faint plant silhouettes visible. As the camera slowly pans to the right, soft light filters through the droplets, refracting into subtle colorful halos. The focus naturally shifts from the flowing droplets on the left leaf to the center of the right leaf, where several crystal-clear droplets are neatly aligned along the main vein. As the movement continues, the left leaf’s edge gradually fades out of the frame, while the right leaf’s intricate texture becomes more defined, revealing delicate reflections within the water droplets.
Camera Pedestal down . A green plant, at the beginning, the camera focuses on its unopened tender leaves. These tender leaves present a deep green color, with a smooth surface, and the leaves are tightly rolled together, forming a spiral shape. The background is blurred, making the plant the focus of the frame. As the video progresses, the camera slowly moves from top to bottom, gradually revealing more details of the plant, while the background blur effect is also changing. The camera continues to move from top to bottom, revealing more details of the plant.
Camera Tilt up . The video showcases an elegant indoor spiral staircase. The initial frame is a static wide-angle shot, clearly presenting the staircase’s structure: vibrant red carpeting covers the steps, while both sides feature intricately designed wrought iron railings with graceful curves. The staircase spirals upward, extending beyond the frame, with a sturdy wooden support column prominently visible, emphasizing its structural stability. Next, the camera smoothly moves upward along the staircase, tilting slightly to the left, making the red-carpeted steps appear taller while also highlighting the delicate ironwork patterns on the railings. The camera then continues its upward movement, gradually revealing the top section of the staircase, where soft wall lighting casts a warm and inviting ambiance. Toward the end, the camera settles at a mid-level perspective, capturing a slightly protruding white decorative element on the upper wall and a dark hanging light fixture at the top. The video concludes with this harmonious composition, emphasizing the staircase’s refined craftsmanship and architectural beauty.
Camera Dolly in . A cozy little house covered in snow, the camera begins from a distance, with the roof and ground covered in thick snow. In front of the house, there is a small balcony, and a thin layer of snow covers the railing of the balcony. As the camera advances, the striking red door at the house entrance becomes prominent, in front of the door is a snow-covered path leading to the steps. On both sides of the steps, there is a pine tree each, with some snow piled up on the trees. The camera continues to push forward, showing the windows of the house, with white curtains hanging on the windows, and snow accumulates on the windowsills. The camera continues to advance, giving a clearer view of the red front door.
Camera Pan right And Tilt up . A panoramic view of a tranquil lake, with clear water, surrounded by lush mountains and blue skies with white clouds. In the opening shot, the lake occupies most of the picture, with the sunlight shining on the lake forming a faint golden halo. The towering mountains on the left and the reflections of the trees are clearly visible in the lake, with green vegetation at the foot of the mountains surrounding the lakeshore. The camera slowly moves to the right, gradually revealing the more expansive lake in the distance and the mountains surrounding the lake. These mountains, under the reflection of the sunlight, have increasingly clear outlines, with thick snow covering the peaks, majestic and imposing. Continuing to move to the right, the silhouette of the distant mountains begins to faintly fade out, and the blue lake water stretches towards the distance, connecting with the more expansive sky. The sky is azure, with a few white clouds floating, adding dynamism and vitality to the entire scene. Finally, the camera slowly tilts upwards, capturing the more expansive sky and the magnificent view of the lake.
To better demonstrate the cumulative spatiotemporal consistency of DropletVideo, we have selected several industry-recognized video generation models for comparison, including Hailuo, Kling v1.6, Gen-3, Vidu, Vivago, Qingying, CogVideoX-Fun, and WanX. Out of the compared models, only CogVideoX-Fun and WanX are open-source, similar to our approach, whereas the remaining models are closed-source. We conducted comparisons using examples from various scenarios mentioned earlier, such as boat, kitchen, lake, snow, staircase, and sunset.
The video shows a pair of small boats floating peacefully on a tranquil lake, with a magnificent sunset sky as the backdrop. the boat on the right is slowly chasing the boat on the left, with a soft golden glow reflecting the afterglow of the setting sun. The camera slowly moves from right to left, gradually revealing more background details. The distant city skyline appears hazy and dreamlike under the sunset, with a few tall buildings faintly visible. On the left side of the frame, tree branches sway gently in the breeze, adding a touch of natural movement to the scene. As the camera continues to move left , another small boat is shown quietly moored on the water on the left side, contrasting sharply with the distant city buildings.
DropletVideo(ours 水滴)
Kling v1.6(可灵)
Vidu 2.0
WanX 2.1(万象)
Hailuo I2V-01-Live(海螺)
Qingying I2V 2.0(清影)
HunyuanVideo(腾讯混元-Video 文生视频)
Vivago
Gen3 Alpha Turbo
The video showcases a chef focusing on the process of cooking in a modern kitchen, with professional kitchen equipment behind him and a clean and tranquil surrounding environment. At the start of the video, the chef is wearing a tall white chef's hat, a black chef's coat, and a white apron, standing in front of the central kitchen counter. The camera focuses on the chef's skillful hands as he uses a bright knife to chop various fresh ingredients on the worktable. These ingredients include red tomatoes, yellow peppers, green cucumbers, and a tall green cauliflower. The vegetables are colorful and neatly arranged. In the background, you can see the metal exhaust hood and several modern stainless-steel kitchen appliances. The kitchen is empty except for the chef, who is working attentively. As the video progresses, the camera slowly pans to the right, and a red apple gradually enters the frame, which is very fresh .
DropletVideo(ours 水滴)
Kling v1.6(可灵)
Vidu 2.0
WanX 2.1(万象)
Hailuo I2V-01-Live(海螺)
Qingying I2V 2.0(清影)
HunyuanVideo(腾讯混元-Video 文生视频)
Vivago
Gen3 Alpha Turbo
A panoramic view of a tranquil lake, with clear water, surrounded by lush mountains and blue skies with white clouds. In the opening shot, the lake occupies most of the picture, with the sunlight shining on the lake forming a faint golden halo. The towering mountains on the left and the reflections of the trees are clearly visible in the lake, with green vegetation at the foot of the mountains surrounding the lakeshore. The camera slowly moves to the right, gradually revealing the more expansive lake in the distance and the mountains surrounding the lake. These mountains, under the reflection of the sunlight, have increasingly clear outlines, with thick snow covering the peaks, majestic and imposing. Continuing to move to the right, the silhouette of the distant mountains begins to faintly fade out, and the blue lake water stretches towards the distance, connecting with the more expansive sky. The sky is azure, with a few white clouds floating, adding dynamism and vitality to the entire scene. Finally, the camera slowly tilts upwards , capturing the more expansive sky and the magnificent view of the lake.
DropletVideo(ours 水滴)
Kling v1.6(可灵)
Vidu 2.0
WanX 2.1(万象)
Hailuo I2V-01-Live(海螺)
Qingying I2V 2.0(清影)
HunyuanVideo(腾讯混元-Video 文生视频)
Vivago
Gen3 Alpha Turbo
A tranquil and beautiful snow scene, with a delicate glass snowflake placed in the center on soft snow. The background is a vast snowy plain dotted with pine trees, and the afterglow of the setting sun in the sky sprinkles a gentle glow. The video begins with the glass snowflake in the center of the frame, with sunlight passing through its transparent body, making it shine with colorful light. The snowflake's design is detailed, with clear edges and corners. The camera slowly rotates to the right around the snowflake, the distant pine trees are naturally distributed, appearing somewhat bent under the weight of the snow on the layered slopes. The camera continues to slowly rotate to the right and around the snowflake, another mountain view gradually comes into sight, with a few tall pine trees standing on the hilltop. On the horizon, the sun is about to set, and the remaining light turns the sky from light blue to warm orange. The camera continues to slowly rotate to the right around the snowflake, finally, the frame stays on the central glass snowflake, where the distant mountain top meets the horizon, and sunlight reflects on the snow.
DropletVideo(ours 水滴)
Kling v1.6(可灵)
Vidu 2.0
WanX 2.1(万象)
Hailuo I2V-01-Live(海螺)
Qingying I2V 2.0(清影)
HunyuanVideo(腾讯混元-Video 文生视频)
Vivago
Gen3 Alpha Turbo
The video showcases an elegant indoor spiral staircase. The initial frame is a static wide-angle shot, clearly presenting the staircase’s structure: vibrant red carpeting covers the steps, while both sides feature intricately designed wrought iron railings with graceful curves. The staircase spirals upward, extending beyond the frame, with a sturdy wooden support column prominently visible, emphasizing its structural stability. Next, the camera smoothly moves upward along the staircase, tilting slightly to the left, making the red-carpeted steps appear taller while also highlighting the delicate ironwork patterns on the railings. The camera then continues its upward movement, gradually revealing the top section of the staircase, where Toward the end, the camera settles at a mid-level perspective, capturing a slightly protruding white decorative element on the upper wall and a dark hanging light fixture at the top. soft wall lighting casts a warm and inviting ambiance. The video concludes with this harmonious composition, emphasizing the staircase’s refined craftsmanship and architectural beauty.
DropletVideo(ours 水滴)
Kling v1.6(可灵)
Vidu 2.0
WanX 2.1(万象)
Hailuo I2V-01-Live(海螺)
Qingying I2V 2.0(清影)
HunyuanVideo(腾讯混元-Video 文生视频)
Vivago
Gen3 Alpha Turbo
The video presents a serene and beautiful sunset scene, capturing a flock of birds soaring gracefully under the evening sun, creating a stunning visual. The sun is slowly descending towards the horizon, painting the entire sky in warm shades of orange and red. The clouds, illuminated by the sunset, glow in golden hues, adding to the magnificent scenery. At the center of the frame stands a solitary tree, its branches appearing particularly distinct against the backdrop of the setting sun. As the camera moves slowly, a rolling grassland gradually emerges on the left side of the frame. The grassland, bathed in the sunset’s afterglow, displays varying shades of light and shadow, adding a rhythmic natural beauty to the scene. As the camera continues to pan left , the flight path of the birds becomes increasingly visible, forming a bright arc under the glow of the sunset and enhancing the dynamic beauty of the composition. Further along, another tree appears in the frame, its silhouette sharply defined under the warm hues of the setting sun, with crisp and well-defined lines.
DropletVideo(ours 水滴)
Kling v1.6(可灵)
Vidu 2.0
WanX 2.1(万象)
Hailuo I2V-01-Live(海螺)
Qingying I2V 2.0(清影)
HunyuanVideo(腾讯混元-Video 文生视频)
Vivago
Gen3 Alpha Turbo
@article{zhang2025dropletvideo,
title={DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation},
author={Zhang, Runze and Du, Guoguang and Li, Xiaochuan and Jia, Qi and Jin, Liang and Liu, Lu and Wang, Jingjing and Xu, Cong and Guo, Zhenhua and Zhao, Yaqian and Gong, Xiaoli and Li, Rengang and Fan, Baoyu},
journal={arXiv preprint arXiv:2503.06053},
year={2025}
}