Commonsense Priors from Videos Facilitate 3D Generation
Xiaochuan Li1*, Guoguang Du1*, Runze Zhang1*, Liang Jin1*, Qi Jia1*, Lihua Lu1*, Zhenhua Guo1,
Yaqian Zhao1, Haiyang Liu, Tianqi Wang, Changsheng Li, Xiaoli Gong2, Rengang Li1,3†, Baoyu Fan1,2†
1IEIT, 2Nankai University, 3Tsinghua University | *Equal Contribution, †Corresponding Authors
Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation.


To take advantage of the commonsense priors acquired by video generation models trained in largescale video data and thereby enhance the generalization capability of 3D generative models, we construct a novel dataset Droplet3D-4M to bridge the gap between the video and 3D domains. Specifically, a qualifying 3D sample must satisfy two criteria. First, it should support dense, coherent multi-view sequences. This ensures compatibility with the output interface of video generation models, facilitating the generation of spatially consistent 3D content. Second, it should be accompanied by dense text annotations. This provides stronger supervision, enabling the 3D generator to maximally retain the video model’s semantic understanding of text. As the 3D generator inherits broader semantic knowledge from video data, this dense text supervision allows the generative model to effectively exploit this inherited capability.
Accordingly, our proposed dataset, Droplet3D-4M , comprises dense multi-view rendered videos and fine-grained, multi-view-level text annotations, as illustrated in Fig. 1. Each rendered video consists of an 85-frame image sequence captured from uniformly distributed 360°orbital viewpoints. The angular difference between adjacent frames is strictly controlled to be within 5°, ensuring video coherence, which is a critical factor In training video-driven generative models. In terms of text annotations, we provide dense descriptions with an average length of 260 words, far exceeding those in existing 3D datasets. More importantly, the annotations not only cover holistic appearances such as shape and style but also specifically describe appearance variations induced by changes in viewpoint. For example in Fig. 1, the second paragraph of the text annotation details the side and rear features of the figurine, describing that the yellow backpack becomes partially visible from a side view and is only fully revealed from the back. This fine-grained, view-aware annotation paradigm provides unprecedented supervisory signals for the 3D generative model, effectively guiding and preserving the backbone network’s capacity for complex semantic understanding.
For the construction of Droplet3D-4M , we propose a data pre-processing pipeline that balances quality and efficiency, and construct a new dataset, Droplet3D-4M . This pipeline employs adaptive sampling techniques instead of resource-intensive rendering during the initial screening phase, followed by targeted high-fidelity rendering only for validated assets. Compared to conventional workflows, this approach reduces computational overhead by a factor of 4 to 7× while generating metadatarich outputs of superior quality for multimodal learning tasks. The pipeline consists of three key parts: multi-view video rendering, image evaluation metric filtering, and multi-view-level caption generation, as shown in Fig. 2. Note that the raw 3D models we collected are from Objaverse-XL, comprising 6.3 million models sourced from GitHub and Sketchfab.
To inherit the spatial consistency and semantic knowledge priors that are readily accessible to video models into 3D generation algorithms, we designed and trained Droplet3D. To this end, we specifically employ multi-view rendered videos from Droplet3D-4M to interface with the video backbone model. Furthermore, we use multi-view level annotations to provide fine-grained supervision for model fine-tuning, thereby preserving the model’s cross-modal semantic understanding capability. The above figure illustrates the architecture of Droplet3D Framework.


The technical details of the 3D generation process during Droplet3D is illustrated in the above figure. For any given text or image prompt, Droplet3D first aligns it with the model. Initially, we expand the user’s input requirements based on a lightweight large language model, rewriting them into dense textual descriptions that follow the same distribution as the multi-view level captions in the Droplet3D-4M dataset. Simultaneously, to accommodate inputs from arbitrary viewpoints, we designed a anonical viewpoint alignment module to adjust the perspective of the input images. For the backbone network, we introduce a 3D causal VAE to achieve implicit space encoding and decoding of the video, and employ a multi-modal diffusion transformer to facilitate the fusion of text and video modality features while constraining their independence.
Current mainstream 3D generation models typically only support image or text input. However, Droplet3D conditions on both an initial image and dense text, enabling it to support more targeted creative design.
Example 1
Example 2
Example 3
Example 4
Example 5
Example 6
Example 7
Example 8
Example 9
Example 10
Example 11
Droplet3D exhibits a relatively robust capability in lifting 2D content into 3D. Although the training data for Droplet3D-4M is entirely based on renderings from Objaverse-XL, which consists solely of object-level 3D files, Droplet3D still demonstrates a certain degree of robustness, particularly when the input images differ significantly from the training data distribution, such as stylized AIGC images like comics or sketches. We believe this capability may stem from its prior video training, which has endowed it with extensive general knowledge, making its 3D generation more versatile. This observation is particularly interesting, as it validates, to some extent, our hypothesis that video can facilitate 3D generation.
Additionally, what excites us most is that, as presented in the following figure, our model can perform lifting on images with scene-level styles, which validates that our technical approach is fundamentally distinct from other currently popular native 3D generation methods, which are typically object-level. In contrast, Droplet3D has the potential to convert scene-level content into 3D. The following figure sequentially demonstrates scene-level 3D generation for a manor, an island with lightning, a tranquil riverside at night, and a scene deep within a space station. Similar to a cinematic freeze-frame effect, our model can automatically lift these scenes to 3D, thereby reducing labor costs. From this perspective, it can also be regarded as another form of 3D asset creation. However, what is even more intriguing is that the training set, Droplet3D-4M, contains no scene-level samples. Therefore, this capability can be considered entirely inherited from its ancestral source, the DropletVideo video generation model.
To validate the accuracy of the generated content by Droplet3D given a single image and text prompt, we compared it with other TI-to-3D methods, including LGM, Hunyuan3D-2, MVControl, and TRELLIS.
Example 1
This character is an anthropomorphic onion figure. The main body is bright yellow, with a round cap on top. The body is segmented, with a smooth and plump texture, and features vertical stripes. The character has a wide, open-mouthed smile, white eyes with black pupils, and the facial features are concentrated at the front. It has short, stubby arms and legs, and wears brown round-toed boots. The overall proportions are exaggerated and the style is consistent, exhibiting distinct characteristics of a cartoon toy.
The video begins with a front view of the onion character. The camera then slowly moves around to the side, revealing the round outline of the cap and the parallel contour lines of the onion body’s texture. As the camera gradually shifts to the back, the surface texture of the onion is shown to be symmetrical and continuous, and the edge of the cap is even and smooth. The camera continues to move to the other side, eventually returning to the front view, where the round cap on top and the shiny boots below visually echo each other, further highlighting the character’s cute, cartoonish style. After a full rotation, the video ends, leaving a deep impression of the character’s overall roundness and bright colors.
Droplet3D
LGM
MVControl
TRELLIS
Hunyuan3D-2
Example 2
The image presents a fantastical scene rich in Eastern ambience. The main subject is a traditional multi-story pagoda, perched atop a steep rocky island peak. The pagoda has a complex structure, dominated by dark red and brown tones. Its roofs curve upwards and ascend in tiers, crowned with a golden spire, which gives it a majestic and mysterious appearance. Mist and clouds swirl around, creating a visual illusion of floating in mid-air, making the pagoda seem as if it is suspended above the sky. The island is covered with vibrant red-leaved trees, which complement the red sky in the background, creating an atmosphere that is both warm and solemn.
As the camera circles around the pagoda, the various angles of the structure and changes in the environment can be observed. The front view displays the pagoda's complete edges and eaves, with strong colors making it stand out even more against the red background. From the side, the layered structure of the pagoda is evident. Circling to the back, the suspended pillars between the pagoda's sections appear like paths leading to unsolved mysteries, and the red trees are even denser from this angle. Finally, the camera returns to its initial position, once again revealing the full view of the pagoda and the stunning visual impact brought by the red sky.
Droplet3D
LGM
MVControl
TRELLIS
Hunyuan3D-2
Example 3
The cute girl is dressed in a red mech suit, carrying a heavy red mechanical backpack. She stands upright with her chest out and head held high. The mech suit is mainly a vibrant red, while her hair is blue, styled into a bun with short hair neatly parted down the middle at the back of her head. Her torso is equipped with red armor, and the joints near the red armor on her shoulders, chest, and knees are connected with silvery metallic parts. On her back, there is a heavy red mech backpack that gleams with a metallic sheen.
The audience first sees the robot from the front, with her facial expression clearly visible. Then, as the view rotates to the side, the girl can be seen standing straight with her head held high, her head rounded and her nose slightly upturned. The joints connecting her arm and leg armor are silver, framed by red protective gear, and the center of the shoulder armor features a silver circle. As the rotation continues, the back armor can be observed, revealing the complete red back armor. Finally, the view continues to rotate until a full circle is completed.
Droplet3D
LGM
MVControl
TRELLIS
Hunyuan3D-2
Example 4
The image depicts a cute little girl in a minimalist style, standing upright and giving off a sense of warmth and tranquility. She has smooth, light blonde hair that falls just below her shoulders in soft waves. A pink flower adorns her hair, adding a touch of gentleness and innocence. The girl has delicate features and a sweet expression. She is wearing a loose, light beige long-sleeve sweater, with a slightly longer hem for a casual and natural look. She holds a brown book in both hands, with simple letters on the cover. She is wearing light-colored split-toe cotton socks and no shoes.
The viewpoint starts from the front and then rotates; throughout the process, the girl's posture and smile remain unchanged. As the camera moves to the side, the softness and natural flow of her hair become more apparent, showing its gentle waves and length just past her shoulders. Rotating to the back, the length of her hair at the back matches what is seen from the side. When the camera returns to the front, the letters on the cover of the book in the girl's hands are clearly visible. The entire video presents a multi-angle close-up of this lovely girl, always maintaining her pure and warm aura.
Droplet3D
LGM
MVControl
TRELLIS
Hunyuan3D-2
Example 5
A cartoon panda astronaut with a round face and classic black and white color scheme, wearing a finely crafted white spacesuit with a blue control panel and wiring on the chest, red badge decorations on the shoulders, black gloves, and white boots, continuing the classic color scheme. The overall design is casual and friendly, showcasing a space exploration image.
The video begins with a eye-level perspective, first showing the panda astronaut's front view, from which the smile and the front details of the spacesuit can be observed. As the video continues to rotate, the side view is revealed, making the panda’s round ears and the spacesuit’s backpack structure more prominent. As the panda continues to turn on the screen, the back view gradually appears, showing the equipment and vest design on the back. Finally, the panda completes a 360-degree rotation, giving the audience a full-body view before the video ends. Each angle showcases a different charm of the panda astronaut, making the presentation delightful and full of imagination.
Droplet3D
LGM
MVControl
TRELLIS
Hunyuan3D-2
Example 6
A multi-layered tower house in a magical fairy tale style, with a colorful mottled tile roof, vines entwined around the exterior walls, a chimney and hanging lanterns on the tower top, warm yellow light emanating from the windows, and a cobblestone path leading to the door, with colorful mushrooms and green plants scattered around.
The camera maintains a steady height while smoothly performing a 360-degree horizontal rotation around the fairytale house. It begins by showcasing the main architectural features of the front facade. As the camera moves to the side, numerous windows come into view. When it reaches the back of the house, a stone door entrance is revealed. The camera continues rotating to the other side. Throughout the entire rotation, the structure remains stationary, and the lighting stays consistent.
Droplet3D
LGM
MVControl
TRELLIS
Hunyuan3D-2
Example 7
The magical double-bladed axe combines practicality and decoration. Its T-shaped structure features sharp dual arc blades at the top, with a deep gray metallic surface that has fine textures and geometric patterns. The central sturdy support exudes a regal aura, while the deep wood-colored axe handle wrapped in red and black textured rope enhances grip. The extended bottom provides a balanced feel, and the exquisite design makes it an art collectible.
The video is filmed from a straight-on perspective, starting from the side of the axe. As the axe rotates, viewers gradually see the top of the axe and the detailed textures of the blade. As the rotation continues, the red wrapping decoration on the handle becomes fully visible, showcasing the grip details of the axe. When the rotation reaches the front of the axe, the intricate geometric patterns on the central metal part can be seen—details that are not visible from other angles. Finally, the video completes a full 360-degree rotation of the axe, revealing its elegant yet deadly design.
Droplet3D
LGM
MVControl
TRELLIS
Hunyuan3D-2
@article{li2025droplet3d,
title={Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation},
author={Li, Xiaochuan and Du, Guoguang and Zhang, Runze and Jin, Liang and Jia, Qi and Lu, Lihua and Guo, Zhenhua and Zhao, Yaqian and Liu, Haiyang and Wang, Tianqi and Li, Changsheng and Gong, Xiaoli and Li, Rengang and Fan, Baoyu},
journal={arXiv preprint arXiv:2508.20470},
year={2025}
}