Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

1IEIT, 2Nankai University, 3Tsinghua University
*Equal Contribution, Corresponding Authors

Abstract

Droplet3d abstract

Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D,, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation.

The Droplet3D-4M Dataset

The pipeline we proposed to curate the Droplet3D-4M dataset
Fig. 1 A sample from Droplet3D-4M comprises a 85-frame multi-view rendered video and a fine-grained,
multi-view-level text annotation. Different colors indicate new details from camera angle changes, similarly marked on the left.

To take advantage of the commonsense priors acquired by video generation models trained in largescale video data and thereby enhance the generalization capability of 3D generative models, we construct a novel dataset Droplet3D-4M to bridge the gap between the video and 3D domains. Specifically, a qualifying 3D sample must satisfy two criteria. First, it should support dense, coherent multi-view sequences. This ensures compatibility with the output interface of video generation models, facilitating the generation of spatially consistent 3D content. Second, it should be accompanied by dense text annotations. This provides stronger supervision, enabling the 3D generator to maximally retain the video model’s semantic understanding of text. As the 3D generator inherits broader semantic knowledge from video data, this dense text supervision allows the generative model to effectively exploit this inherited capability.
Accordingly, our proposed dataset, Droplet3D-4M , comprises dense multi-view rendered videos and fine-grained, multi-view-level text annotations, as illustrated in Fig. 1. Each rendered video consists of an 85-frame image sequence captured from uniformly distributed 360°orbital viewpoints. The angular difference between adjacent frames is strictly controlled to be within 5°, ensuring video coherence, which is a critical factor In training video-driven generative models. In terms of text annotations, we provide dense descriptions with an average length of 260 words, far exceeding those in existing 3D datasets. More importantly, the annotations not only cover holistic appearances such as shape and style but also specifically describe appearance variations induced by changes in viewpoint. For example in Fig. 1, the second paragraph of the text annotation details the side and rear features of the figurine, describing that the yellow backpack becomes partially visible from a side view and is only fully revealed from the back. This fine-grained, view-aware annotation paradigm provides unprecedented supervisory signals for the 3D generative model, effectively guiding and preserving the backbone network’s capacity for complex semantic understanding.


The pipeline we proposed to curate the Droplet3D-4M dataset
Fig. 2 The pipeline we proposed to curate the Droplet3D-4M dataset.

For the construction of Droplet3D-4M , we propose a data pre-processing pipeline that balances quality and efficiency, and construct a new dataset, Droplet3D-4M . This pipeline employs adaptive sampling techniques instead of resource-intensive rendering during the initial screening phase, followed by targeted high-fidelity rendering only for validated assets. Compared to conventional workflows, this approach reduces computational overhead by a factor of 4 to 7× while generating metadatarich outputs of superior quality for multimodal learning tasks. The pipeline consists of three key parts: multi-view video rendering, image evaluation metric filtering, and multi-view-level caption generation, as shown in Fig. 2. Note that the raw 3D models we collected are from Objaverse-XL, comprising 6.3 million models sourced from GitHub and Sketchfab.

The Droplet3D Model

Droplet3D Architecture
Droplet3D Framework

To inherit the spatial consistency and semantic knowledge priors that are readily accessible to video models into 3D generation algorithms, we designed and trained Droplet3D. To this end, we specifically employ multi-view rendered videos from Droplet3D-4M to interface with the video backbone model. Furthermore, we use multi-view level annotations to provide fine-grained supervision for model fine-tuning, thereby preserving the model’s cross-modal semantic understanding capability. The above figure illustrates the architecture of Droplet3D Framework.
Droplet3D Architecture
Overview of the Droplet3D Framework

The technical details of the 3D generation process during Droplet3D is illustrated in the above figure. For any given text or image prompt, Droplet3D first aligns it with the model. Initially, we expand the user’s input requirements based on a lightweight large language model, rewriting them into dense textual descriptions that follow the same distribution as the multi-view level captions in the Droplet3D-4M dataset. Simultaneously, to accommodate inputs from arbitrary viewpoints, we designed a anonical viewpoint alignment module to adjust the perspective of the input images. For the backbone network, we introduce a 3D causal VAE to achieve implicit space encoding and decoding of the video, and employ a multi-modal diffusion transformer to facilitate the fusion of text and video modality features while constraining their independence.

Feasible Controllable-Creativity Driven by Language Prompts

Current mainstream 3D generation models typically only support image or text input. However, Droplet3D conditions on both an initial image and dense text, enabling it to support more targeted creative design, which is quite interesting. This is reflected in the variations of the generated assets when we employ AIGC images along with different textual descriptions.

Example 1
Prompt: This character is an anthropomorphic onion figure. The main body is bright yellow, with a round cap on top. The body is segmented, with a smooth and plump texture, and features vertical stripes. The character has a wide, open-mouthed smile, white eyes with black pupils, and the facial features are concentrated at the front. It has short, stubby arms and legs, and wears brown round-toed boots. The overall proportions are exaggerated and the style is consistent, exhibiting distinct characteristics of a cartoon toy.
The video begins with a front view of the onion character. The camera then slowly moves around to the side, revealing the round outline of the cap and the parallel contour lines of the onion body’s texture. As the camera gradually shifts to the back, the surface texture of the onion is shown to be symmetrical and continuous, and the edge of the cap is even and smooth. The camera continues to move to the other side, eventually returning to the front view, where the round cap on top and the shiny boots below visually echo each other, further highlighting the character’s cute, cartoonish style. After a full rotation, the video ends, leaving a deep impression of the character’s overall roundness and bright colors.
Prompt: This character is an anthropomorphic onion figure. The main body is bright yellow, with a round cap on top. The body is segmented, with a smooth and plump texture, and features vertical stripes. The character has a wide, open-mouthed smile, white eyes with black pupils, and the facial features are concentrated at the front. It has short, stubby arms and legs, and wears brown round-toed boots. The overall proportions are exaggerated and the style is consistent, exhibiting distinct characteristics of a cartoon toy.
The camera starts from the front, where the character’s wide open-mouthed smile and bright yellow vertical stripes are clearly visible, and the round cap on top appears smooth and full. As the camera rotates to the back, a beautiful and delicate wind-up key is revealed at the center of the back. The key is green, with four metal gears connected at the end of its arms, and the fine engravings on the gears are clearly defined. The side view shows the thickness of the wind-up key and the sense of rotation of the gears. Finally, the camera returns to the front, highlighting the strong mechanical feel.
Prompt: This character is an anthropomorphic onion figure. The main body is bright yellow, with a round cap on top. The body is segmented, with a smooth and plump texture, and features vertical stripes. The character has a wide, open-mouthed smile, white eyes with black pupils, and the facial features are concentrated at the front. It has short, stubby arms and legs, and wears brown round-toed boots. The overall proportions are exaggerated and the style is consistent, exhibiting distinct characteristics of a cartoon toy.
The camera starts from the front, gradually revealing the character's round cap and the smooth, segmented texture of its body, while the exaggerated and lively smile is clearly shown. As the camera moves to the back, a blue glowing crystal orb is embedded in the center of the back, with light and shadows flowing inside the sphere. From the side, the three-dimensionality of the crystal is displayed. Finally, the camera returns to the front, presenting a unified and complete overall style.
Prompt: This character is an anthropomorphic onion figure. The main body is bright yellow, with a round cap on top. The body is segmented, with a smooth and plump texture, and features vertical stripes. The character has a wide, open-mouthed smile, white eyes with black pupils, and the facial features are concentrated at the front. It has short, stubby arms and legs, and wears brown round-toed boots. The overall proportions are exaggerated and the style is consistent, exhibiting distinct characteristics of a cartoon toy.
The camera starts from the front and slowly rotates to the right, allowing the viewer to observe the curvature of the cap and the changes in body proportions from a side angle. As the camera moves to the back, a prominent black-and-white QR code sticker is revealed on the character’s back, occupying more than two-thirds of the rear surface. The QR code is a common square shape, with a light gray border to ensure clear visibility against the yellow background. The sticker is slightly raised, fitting well with the curve of the back, and serves as an independent visual focal point. The camera continues rotating to the left, with the corners of the QR code still clearly visible from the side, before smoothly returning to the front view.
Example 2
Prompt: The image presents a fantastical scene rich in Eastern ambience. The main subject is a traditional multi-story pagoda, perched atop a steep rocky island peak. The pagoda has a complex structure, dominated by dark red and brown tones. Its roofs curve upwards and ascend in tiers, crowned with a golden spire, which gives it a majestic and mysterious appearance. Mist and clouds swirl around, creating a visual illusion of floating in mid-air, making the pagoda seem as if it is suspended above the sky. The island is covered with vibrant red-leaved trees, which complement the red sky in the background, creating an atmosphere that is both warm and solemn.
As the camera circles around the pagoda, the various angles of the structure and changes in the environment can be observed. The front view displays the pagoda's complete edges and eaves, with strong colors making it stand out even more against the red background. From the side, the layered structure of the pagoda is evident. Circling to the back, the suspended pillars between the pagoda's sections appear like paths leading to unsolved mysteries, and the red trees are even denser from this angle. Finally, the camera returns to its initial position, once again revealing the full view of the pagoda and the stunning visual impact brought by the red sky.
Prompt: The image presents a fantastical scene rich in Eastern ambience. The main subject is a traditional multi-story pagoda, perched atop a steep rocky island peak. The pagoda has a complex structure, dominated by dark red and brown tones. Its roofs curve upwards and ascend in tiers, crowned with a golden spire, which gives it a majestic and mysterious appearance. Mist and clouds swirl around, creating a visual illusion of floating in mid-air, making the pagoda seem as if it is suspended above the sky. The island is covered with vibrant red-leaved trees, which complement the red sky in the background, creating an atmosphere that is both warm and solemn.
As the camera circles around the pagoda, the various angles of the structure and changes in its surroundings can be observed. The front view displays the full edges and eaves of the pagoda; from the side, the layered structure of the pagoda becomes apparent, and the enclosing wall along the edge of the cliff behind the pagoda gradually comes into view. Circling to the back, the suspended pillars between sections of the pagoda appear like paths leading to unsolved mysteries, and the front of the city wall at the cliff's edge can be seen. Finally, the camera returns to its original position, once again revealing the full view of the pagoda and the stunning visual impact brought by the red sky.
Prompt: The image presents a fantastical scene rich in Eastern ambience. The main subject is a traditional multi-story pagoda, perched atop a steep rocky island peak. The pagoda has a complex structure, dominated by dark red and brown tones. Its roofs curve upwards and ascend in tiers, crowned with a golden spire, which gives it a majestic and mysterious appearance. Mist and clouds swirl around, creating a visual illusion of floating in mid-air, making the pagoda seem as if it is suspended above the sky. The island is covered with vibrant red-leaved trees, which complement the red sky in the background, creating an atmosphere that is both warm and solemn.
As the camera circles around the pagoda, various angles of the structure and changes in its surroundings can be observed. The front view displays the complete edges and eaves of the pagoda, with the strong color tones making it stand out even more against the red background. From the side, the layered structure of the pagoda becomes apparent. Circling to the back, the suspended pillars between the sections of the pagoda appear like paths leading to unsolved mysteries, and the red trees are especially dense on the backside. Finally, the camera returns to its original position, once again revealing the full view of the pagoda and the stunning visual impact of the red sky.
Prompt: The image presents a fantastical scene rich in Eastern ambience. The main subject is a traditional multi-story pagoda, perched atop a steep rocky island peak. The pagoda has a complex structure, dominated by dark red and brown tones. Its roofs curve upwards and ascend in tiers, crowned with a golden spire, which gives it a majestic and mysterious appearance. Mist and clouds swirl around, creating a visual illusion of floating in mid-air, making the pagoda seem as if it is suspended above the sky. The island is covered with vibrant red-leaved trees, which complement the red sky in the background, creating an atmosphere that is both warm and solemn.
As the camera circles around the pagoda, various angles of the structure and changes in its surroundings can be observed. The front view displays the complete edges and eaves of the pagoda, with the bold color tones making it stand out even more against the red background. From the side, the layered structure of the pagoda becomes apparent, and the pavilion behind the pagoda gradually comes into view. Circling to the back, the suspended pillars between the sections of the pagoda look like paths leading to unsolved mysteries, and the front of the small pavilion can be seen. Finally, the camera returns to its original position, once again revealing the full view of the pagoda and the stunning visual impact of the red sky.
Example 3
Prompt: The cute girl is dressed in a red mech suit, carrying a heavy red mechanical backpack. She stands upright with her chest out and head held high. The mech suit is mainly a vibrant red, while her hair is blue, styled into a bun with short hair neatly parted down the middle at the back of her head. Her torso is equipped with red armor, and the joints near the red armor on her shoulders, chest, and knees are connected with silvery metallic parts. On her back, there is a heavy red mech backpack that gleams with a metallic sheen.
The audience first sees the robot from the front, with her facial expression clearly visible. Then, as the view rotates to the side, the girl can be seen standing straight with her head held high, her head rounded and her nose slightly upturned. The joints connecting her arm and leg armor are silver, framed by red protective gear, and the center of the shoulder armor features a silver circle. As the rotation continues, the back armor can be observed, revealing the complete red back armor. Finally, the view continues to rotate until a full circle is completed.
Prompt: The cute girl is dressed in a red mech suit, carrying a heavy red mechanical backpack. She stands upright with her chest out and head held high. The mech suit is mainly a vibrant red, while her hair is blue, styled into a bun with short hair neatly parted down the middle at the back of her head. Her torso is equipped with red armor, and the joints near the red armor on her shoulders, chest, and knees are connected with silvery metallic parts. On her back, there is a heavy red mech backpack that gleams with a metallic sheen.
The audience first sees the robot from the front, with her facial expression clearly visible. Then, as the view rotates to the side, the girl can be seen standing straight with her head held high, her head rounded and her nose slightly upturned. From the side, the heavy red metal backpack is visible. As the rotation continues, the back armor can be observed, with the backpack centered on her back. Finally, the view continues to rotate until a full circle is completed.
Prompt: The cute girl is dressed in a red mech suit, carrying a heavy red mechanical backpack. She stands upright with her chest out and head held high. The mech suit is mainly a vibrant red, while her hair is blue, styled into a bun with short hair neatly parted down the middle at the back of her head. Her torso is equipped with red armor, and the joints near the red armor on her shoulders, chest, and knees are connected with silvery metallic parts. On her back, there is a heavy red mech backpack that gleams with a metallic sheen.
The audience first sees the robot from the front, with her facial expression clearly visible. Then, as the view rotates to the side, the girl can be seen standing upright with her head held high, her head rounded and her nose slightly upturned. From the side, a miniature control screen can be seen closely attached to her back. As the rotation continues, the back armor comes into view, with the miniature control screen positioned in the center. Finally, the view continues to rotate until a full circle is completed.
Prompt: The cute girl is dressed in a red mech suit, carrying a heavy red mechanical backpack. She stands upright with her chest out and head held high. The mech suit is mainly a vibrant red, while her hair is blue, styled into a bun with short hair neatly parted down the middle at the back of her head. Her torso is equipped with red armor, and the joints near the red armor on her shoulders, chest, and knees are connected with silvery metallic parts. On her back, there is a heavy red mech backpack that gleams with a metallic sheen.
The audience first sees the robot from the front, with her facial expression clearly visible. Then, as the view rotates to the side, the girl can be seen standing upright with her head held high, her head rounded and her nose slightly upturned. From the side, a red booster can be seen closely attached to her back. As the rotation continues, the back armor comes into view, with a solar power panel positioned in the center. Finally, the view continues to rotate until a full circle is completed.
Prompt: The cute girl is dressed in a red mech suit, carrying a heavy red mechanical backpack. She stands upright with her chest out and head held high. The mech suit is mainly a vibrant red, while her hair is blue, styled into a bun with short hair neatly parted down the middle at the back of her head. Her torso is equipped with red armor, and the joints near the red armor on her shoulders, chest, and knees are connected with silvery metallic parts. On her back, there is a heavy red mech backpack that gleams with a metallic sheen.
The audience first sees the robot from the front, with her facial expression clearly visible. Then, as the view rotates to the side, the girl can be seen standing upright with her head held high, her head rounded and her nose slightly upturned. From the side, a blue braid can be seen hanging down. As the rotation continues, the back armor comes into view, with the blue braid positioned in the center. Finally, the view continues to rotate until a full circle is completed.
Prompt: The cute girl is dressed in a red mech suit, carrying a heavy red mechanical backpack. She stands upright with her chest out and head held high. The mech suit is mainly a vibrant red, while her hair is blue, styled into a bun with short hair neatly parted down the middle at the back of her head. Her torso is equipped with red armor, and the joints near the red armor on her shoulders, chest, and knees are connected with silvery metallic parts. On her back, there is a heavy red mech backpack that gleams with a metallic sheen.
The audience first sees the robot from the front, with her facial expression clearly visible. Then, as the view rotates to the side, the girl can be seen standing upright with her head held high, her head rounded and her nose slightly upturned. From the side, delicate fairy wings can be seen. As the rotation continues, the back armor comes into view. Finally, the view continues to rotate until a full circle is completed.
Example 4
Prompt: The image depicts a cute little girl in a minimalist style, standing upright and giving off a sense of warmth and tranquility. She has smooth, light blonde hair that falls just below her shoulders in soft waves. A pink flower adorns her hair, adding a touch of gentleness and innocence. The girl has delicate features and a sweet expression. She is wearing a loose, light beige long-sleeve sweater, with a slightly longer hem for a casual and natural look. She holds a brown book in both hands, with simple letters on the cover. She is wearing light-colored split-toe cotton socks and no shoes.
The viewpoint starts from the front and then rotates; throughout the process, the girl's posture and smile remain unchanged. As the camera moves to the side, the softness and natural flow of her hair become more apparent, showing its gentle waves and length just past her shoulders. Rotating to the back, the length of her hair at the back matches what is seen from the side. When the camera returns to the front, the letters on the cover of the book in the girl's hands are clearly visible. The entire video presents a multi-angle close-up of this lovely girl, always maintaining her pure and warm aura.
Prompt: The image depicts a cute little girl in a minimalist style, standing upright and giving off a sense of warmth and tranquility. She has smooth, light blonde hair that falls just below her shoulders in soft waves. A pink flower adorns her hair, adding a touch of gentleness and innocence. The girl has delicate features and a sweet expression. She is wearing a loose, light beige long-sleeve sweater, with a slightly longer hem for a casual and natural look. She holds a brown book in both hands, with simple letters on the cover. She is wearing light-colored split-toe cotton socks and no shoes.
The viewpoint starts from the front and rotates; throughout the process, the girl’s posture and smile remain unchanged. As the camera moves to the side, the softness and natural flow of her hair become more apparent, showing gentle waves, with the hair at the back reaching down to her waist. Rotating to the back, the length of her hair matches what was seen from the side. When the camera returns to the front, the letters on the cover of the book in the girl's hands are clearly visible. The entire video presents a multi-angle close-up of this adorable girl, always maintaining her pure and warm aura.
Prompt: The image depicts a cute little girl in a minimalist style, standing upright and giving off a sense of warmth and tranquility. She has smooth, light blonde hair that falls just below her shoulders in soft waves. A pink flower adorns her hair, adding a touch of gentleness and innocence. The girl has delicate features and a sweet expression. She is wearing a loose, light beige long-sleeve sweater, with a slightly longer hem for a casual and natural look. She holds a brown book in both hands, with simple letters on the cover. She is wearing light-colored split-toe cotton socks and no shoes.
The viewpoint starts from the front and rotates; throughout the process, the girl’s posture and smile remain unchanged. As the camera moves to the side, the softness and natural flow of her hair become more apparent, with gentle waves, and the hair at the back just reaching past her shoulders, with the hair at the back of her head tucked into her clothes. Rotating to the back, the length of the girl’s hair matches what is seen from the side. When the camera returns to the front, the letters on the cover of the book in the girl's hands are clearly visible. The entire video presents multi-angle close-ups of this adorable girl, always maintaining her pure and warm aura.
Example 5
Prompt: A cartoon panda astronaut with a round face and classic black and white color scheme, wearing a finely crafted white spacesuit with a blue control panel and wiring on the chest, red badge decorations on the shoulders, black gloves, and white boots, continuing the classic color scheme. The overall design is casual and friendly, showcasing a space exploration image.
The video begins with a eye-level perspective, first showing the panda astronaut's front view, from which the smile and the front details of the spacesuit can be observed. As the video continues to rotate, the side view is revealed, making the panda’s round ears and the spacesuit’s backpack structure more prominent. As the panda continues to turn on the screen, the back view gradually appears, showing the equipment and vest design on the back. Finally, the panda completes a 360-degree rotation, giving the audience a full-body view before the video ends. Each angle showcases a different charm of the panda astronaut, making the presentation delightful and full of imagination.
Prompt: A cartoon panda astronaut with a round face and classic black and white color scheme, wearing a finely crafted white spacesuit with a blue control panel and wiring on the chest, red badge decorations on the shoulders, black gloves, and white boots, continuing the classic color scheme. The overall design is casual and friendly, showcasing a space exploration image.
The video begins with a eye-level perspective, first showing the panda astronaut's front view, from which the smile and the front details of the spacesuit can be observed. As the video continues to rotate, the audience can see the side view, making the panda's round ears and the outline of the spacesuit more prominent. As the panda rotates on the screen, the back gradually comes into view, revealing an orange square experimental backpack equipped with various scientific instruments, antennas, and sample collection containers. Finally, the panda completes a 360-degree rotation, allowing the audience to see the complete image of the research astronaut. The video ends at this point.
Prompt: A cartoon panda astronaut with a round face and classic black and white color scheme, wearing a finely crafted white spacesuit with a blue control panel and wiring on the chest, red badge decorations on the shoulders, black gloves, and white boots, continuing the classic color scheme. The overall design is casual and friendly, showcasing a space exploration image.
The video starts with a eye-level perspective, first showing the panda astronaut's front view, from which the smile and the front details of the spacesuit can be observed. As the video continues to rotate, the audience can see the side view, making the panda's round ears and the outline of the spacesuit more prominent. As the panda rotates on the screen, the back gradually appears, revealing a transparent spherical backpack containing an energy core emitting rainbow-colored light. Finally, the panda completes a 360-degree rotation, presenting a highly sci-fi futuristic astronaut image, and the video ends at this point.
Example 6
Prompt: A multi-layered tower house in a magical fairy tale style, with a colorful mottled tile roof, vines entwined around the exterior walls, a chimney and hanging lanterns on the tower top, warm yellow light emanating from the windows, and a cobblestone path leading to the door, with colorful mushrooms and green plants scattered around.
The camera maintains a steady height while smoothly performing a 360-degree horizontal rotation around the fairytale house. It begins by showcasing the main architectural features of the front facade. As the camera moves to the side, numerous windows come into view. When it reaches the back of the house, a stone door entrance is revealed. The camera continues rotating to the other side. Throughout the entire rotation, the structure remains stationary, and the lighting stays consistent.
Prompt: A multi-layered tower house in a magical fairy tale style, with a colorful mottled tile roof, vines entwined around the exterior walls, a chimney and hanging lanterns on the tower top, warm yellow light emanating from the windows, and a cobblestone path leading to the door, with colorful mushrooms and green plants scattered around.
The camera maintains a steady height while smoothly performing a 360-degree horizontal rotation around the fairytale house. It begins by showcasing the main architectural features and entrance details at the front. As the camera turns to the side, red doors and windows come into view. Continuing to the back of the house, a functional brown wooden door is visible. As the rotation moves to the other side, a garden enclosed by white railings appears. Throughout the entire shoot, the building remains completely static.
Example 7
Prompt: The magical double-bladed axe combines practicality and decoration. Its T-shaped structure features sharp dual arc blades at the top, with a deep gray metallic surface that has fine textures and geometric patterns. The central sturdy support exudes a regal aura, while the deep wood-colored axe handle wrapped in red and black textured rope enhances grip. The extended bottom provides a balanced feel, and the exquisite design makes it an art collectible.
The video is filmed from a straight-on perspective, starting from the side of the axe. As the axe rotates, viewers gradually see the top of the axe and the detailed textures of the blade. As the rotation continues, the red wrapping decoration on the handle becomes fully visible, showcasing the grip details of the axe. When the rotation reaches the front of the axe, the intricate geometric patterns on the central metal part can be seen—details that are not visible from other angles. Finally, the video completes a full 360-degree rotation of the axe, revealing its elegant yet deadly design.
Prompt: The magical double-bladed axe combines practicality and decoration. Its T-shaped structure features sharp dual arc blades at the top, with a deep gray metallic surface that has fine textures and geometric patterns. The central sturdy support exudes a regal aura, while the deep wood-colored axe handle wrapped in red and black textured rope enhances grip. The extended bottom provides a balanced feel, and the exquisite design makes it an art collectible.
The video is filmed at eye level, with the camera maintaining a consistent height while performing a 360-degree horizontal rotation. It begins from the side, showcasing the overall design of the axe and details of the blade. As the camera moves to the back, the axe blade is revealed to be covered in intricate dragon scale patterns, with the edges of the scales displaying a metallic sheen and a red metal texture. As the rotation continues, the handle and the red dragon scale decorations create a visually rich layered effect. Throughout the entire shoot, the axe remains completely still, and the details of the dragon scale pattern are clearly visible.
Prompt: The magical double-bladed axe combines practicality and decoration. Its T-shaped structure features sharp dual arc blades at the top, with a deep gray metallic surface that has fine textures and geometric patterns. The central sturdy support exudes a regal aura, while the deep wood-colored axe handle wrapped in red and black textured rope enhances grip. The extended bottom provides a balanced feel, and the exquisite design makes it an art collectible.
The video is shot from a level perspective, with the camera maintaining a stable height for a 360-degree horizontal rotation. Initially, the camera shows the axe's basic shape and blade features from the side. When the camera moves to the back, a jade-green life gem floats between the two blades. Inside the gem, there is flowing light, surrounded by four golden feathers that remain in a static, slightly swaying posture in the breeze. As the rotation continues, the handle and red wrapping decoration naturally blend with tradition. From the front, the central geometric pattern and the mysterious glow of the life gem complement each other. Throughout the entire rotation, everything remains still, clearly displaying each leaf texture and the gem's transparent quality.
Prompt: The magical double-bladed axe combines practicality and decoration. Its T-shaped structure features sharp dual arc blades at the top, with a deep gray metallic surface that has fine textures and geometric patterns. The central sturdy support exudes a regal aura, while the deep wood-colored axe handle wrapped in red and black textured rope enhances grip. The extended bottom provides a balanced feel, and the exquisite design makes it an art collectible.
The video is filmed at eye level, with the camera maintaining a steady height while performing a 360-degree horizontal rotation. It begins by showing the silhouette and blade textures of the axe from the side. As the rotation progresses and the camera moves to the back, a glowing purple crystal is revealed at the center between the two blades. The crystal is hexagonal in shape, with mysterious rune-like patterns faintly visible inside, and it is held in place by finely crafted metallic claw-shaped brackets. The entire filming remains static.
Example 8
Prompt: A cartoon monkey is dressed in a pink dress covered with white polka dots, radiating a sense of innocence and cuteness. The dress fits snugly and falls naturally, with the hem reaching the knees and gently swaying with movement, full of liveliness. The monkey has a rounded and well-balanced body, with large, round ears and soft pink inner ear flaps that complement the style of the outfit. The hat matches the dress’s pattern and is topped with a fluffy pom-pom, adding layers and a playful vibe.
The camera starts from the front, showcasing the monkey’s details and the intricate pattern design of the outfit. As the camera slowly rotates to the side, the side profile of the hat and the three-dimensional look of the ears become more apparent, and you can also see the back hem of the outfit gently swaying as the monkey moves. Rotating to the back reveals that the back hem is slightly wider than the front, ensuring smooth movement while walking. Finally, the camera steadily returns to the front, with the whole scene filled with a sense of fun and imagination.
Prompt: A cartoon monkey is dressed in a pink dress covered with white polka dots, radiating a sense of innocence and cuteness. The dress fits snugly and falls naturally, with the hem reaching the knees and gently swaying with movement, full of liveliness. The monkey has a rounded and well-balanced body, with large, round ears and soft pink inner ear flaps that complement the style of the outfit. The hat matches the dress’s pattern and is topped with a fluffy pom-pom, adding layers and a playful vibe.
The camera captures the monkey’s lively expression and the polka dot details of the dress from the front. Then, as it shifts to the side, a pair of pink butterfly wings can be seen unfolding from the monkey’s back, with the base fitting closely and naturally to the body, appearing delicate and refined. The wings have a soft gradient sheen, shimmering under the light. When the camera moves to the back, the wings occupy the entire back area, with white polka dots along the edges that echo the pattern of the dress. The tail pokes out from the center of the hem, curving slightly upward and adding a playful touch. Finally, the camera smoothly returns to the front, completing the scene with a lighthearted and cheerful atmosphere.
Prompt: A cartoon monkey is dressed in a pink dress covered with white polka dots, radiating a sense of innocence and cuteness. The dress fits snugly and falls naturally, with the hem reaching the knees and gently swaying with movement, full of liveliness. The monkey has a rounded and well-balanced body, with large, round ears and soft pink inner ear flaps that complement the style of the outfit. The hat matches the dress’s pattern and is topped with a fluffy pom-pom, adding layers and a playful vibe.
The camera captures the monkey’s lively expression and the polka dot details of the dress from the front. As it shifts to the side, the side profile of the hat and the three-dimensional quality of the ears become more apparent, appearing natural and harmonious. When the camera moves to the back, a blue mini backpack can be seen on the monkey’s shoulders, adorned with white polka dots matching the dress, giving it a delicate and cute look. The backpack’s straps fit snugly against the body. Its rounded shape echoes the monkey’s plump proportions. Finally, the camera smoothly returns to the front, with the backpack’s design perfectly integrated into the overall look, adding extra playfulness and energy to the scene.
Prompt: A cartoon monkey is dressed in a pink dress covered with white polka dots, radiating a sense of innocence and cuteness. The dress fits snugly and falls naturally, with the hem reaching the knees and gently swaying with movement, full of liveliness. The monkey has a rounded and well-balanced body, with large, round ears and soft pink inner ear flaps that complement the style of the outfit. The hat matches the dress’s pattern and is topped with a fluffy pom-pom, adding layers and a playful vibe.
The camera starts from the front and slides to the side, gradually revealing the star patterns along the edge of the purple cape. The cape’s soft texture contrasts with the drape of the skirt. As the camera moves to the back, the cape is fully spread out, covering the shoulders and the entire back area. The details of the yellow star patterns and tassels are clearly visible, and the hem of the cape gently rests on the skirt, creating a layered overall design. Finally, the camera smoothly returns to the front, with the cape’s design perfectly integrated into the overall look, adding more playfulness and energy to the scene.
Example 9
Prompt: This video showcases a cartoon-style plush toy model with a compact, cylindrical structure. The toy is covered in gradient rainbow fur, transitioning from pink, orange-yellow, green, blue, to purple from the top of the head to the feet. The ears are upright bunny ears, with white inner sides and the outer sides continuing the pink fur of the head. The hands and feet are rounded and ball-shaped, with pale yellow palms and soles. The face is cartoonish, featuring large eyes and a prominent smile, resulting in a cohesive overall appearance with vivid colors.
The video is filmed at eye level, first displaying the plush toy from the front, where the rainbow hues appear especially striking under the lighting, leaving a strong impression with their bold contrast. The toy slowly rotates, gradually revealing the side details, making the shape of the ears and the luster of the fur more pronounced. Afterward, the back of the toy is fully shown, with the colors matching the front and presenting an even gradient. The video concludes with the toy completing a 360-degree rotation, providing viewers with a comprehensive, all-around view of the toy.
Prompt: This video showcases a cartoon-style plush toy model with a compact, cylindrical structure. The toy is covered in gradient rainbow fur, transitioning from pink, orange-yellow, green, blue, to purple from the top of the head to the feet. The ears are upright bunny ears, with white inner sides and the outer sides continuing the pink fur of the head. The hands and feet are rounded and ball-shaped, with pale yellow palms and soles. The face is cartoonish, featuring large eyes and a prominent smile, resulting in a cohesive overall appearance with vivid colors.
The video is filmed at eye level, initially showing the front of the plush toy. Then, the camera slowly moves to the side, where the three-dimensional quality of the ear becomes more prominent. Next, the camera shifts to the back of the toy, where a pink bunny tail is visible near the lower back. The tail is rounded overall, with evenly distributed, soft fur, and its base aligns with the bottom edge of the toy. Finally, the camera completes a 360-degree rotation at a steady pace, fully showcasing the toy’s overall design details and beauty from every angle.
Prompt: This video showcases a cartoon-style plush toy model with a compact, cylindrical structure. The toy is covered in gradient rainbow fur, transitioning from pink, orange-yellow, green, blue, to purple from the top of the head to the feet. The ears are upright bunny ears, with white inner sides and the outer sides continuing the pink fur of the head. The hands and feet are rounded and ball-shaped, with pale yellow palms and soles. The face is cartoonish, featuring large eyes and a prominent smile, resulting in a cohesive overall appearance with vivid colors.
The video begins with a front view, where the rainbow-colored plush appears soft and shiny under the light. The camera slowly moves to the side, showcasing the three-dimensional quality of the toy's ears and the smooth gradient of its fur. As the camera continues to rotate to the back, the gradient fur on the back remains evenly distributed, and a large red bow stands out at the back of the head, clearly designed and centered. Finally, the camera completes a full rotation, thoroughly displaying the toy from every angle and ensuring that all details are clearly presented.
Prompt: This video showcases a cartoon-style plush toy model with a compact, cylindrical structure. The toy is covered in gradient rainbow fur, transitioning from pink, orange-yellow, green, blue, to purple from the top of the head to the feet. The ears are upright bunny ears, with white inner sides and the outer sides continuing the pink fur of the head. The hands and feet are rounded and ball-shaped, with pale yellow palms and soles. The face is cartoonish, featuring large eyes and a prominent smile, resulting in a cohesive overall appearance with vivid colors.
The video starts from the front at eye level, with the plush toy’s rainbow gradient colors appearing vivid and rich under the light. The camera slowly moves to the side, showcasing the soft texture of the fur. As the camera gradually shifts to the back, a cartoon backpack can be seen attached to the lower back, featuring a soft pastel yellow color. Finally, the camera completes a smooth rotation, highlighting the toy’s details and design features from all angles.
Example 10
Prompt: The character in the video presents a cartoon-style teenage adventurer, full of energy and a spirit of adventure. The character wears a light gray hat adorned with a yellow badge in the center, symbolizing professionalism and enthusiasm. They are dressed in a light blue shirt with rolled-up sleeves, paired with a yellow tie featuring black stripes, giving a simple and practical overall design. The dark brown shoulder strap contrasts sharply with the round pink tool at the waist, which is uniquely designed and highly functional. Deep blue trousers are paired with yellow work boots, creating a harmonious color scheme that showcases both a professional style and an adventurous trait.
As the camera circles around the character, the back reveals a large toolbox backpack that fits snugly against the back, featuring a hard-shell deep brown design with a yellow handle on top. The sides of the backpack are decorated with rivets and equipped with multiple small tool pockets, highlighting its functionality and practicality. As the camera sweeps over the lower back, it shows the bottom of the toolbox with a thickened design for added stability and durability, consistent with the overall style. Returning to the front, the pink tool and the backpack form a striking contrast, and the toolbox design further emphasizes the character’s adventurous and professional qualities.
Prompt: The character in the video presents a cartoon-style teenage adventurer, full of energy and a spirit of adventure. The character wears a light gray hat adorned with a yellow badge in the center, symbolizing professionalism and enthusiasm. They are dressed in a light blue shirt with rolled-up sleeves, paired with a yellow tie featuring black stripes, giving a simple and practical overall design. The dark brown shoulder strap contrasts sharply with the round pink tool at the waist, which is uniquely designed and highly functional. Deep blue trousers are paired with yellow work boots, creating a harmonious color scheme that showcases both a professional style and an adventurous trait.
As the camera circles around the character, the details of the explorer backpack on the character’s back are revealed. The backpack features a deep blue tone, accented with yellow zippers and pockets, and is connected on both sides by wide straps. Shifting to one side, a mesh pocket on the side of the backpack can be seen, equipped with an elastic band to adjust how securely items are held. On the other side, a vest loop is visible under the shirt hem, adding a practical element and contributing to the overall explorer design. Returning to the front, a small pink exploration tool, whose purpose is unclear, is possibly used for site measurement or positioning, further emphasizing the character’s pragmatic and adventurous theme.
Prompt: The character in the video presents a cartoon-style teenage adventurer, full of energy and a spirit of adventure. The character wears a light gray hat adorned with a yellow badge in the center, symbolizing professionalism and enthusiasm. They are dressed in a light blue shirt with rolled-up sleeves, paired with a yellow tie featuring black stripes, giving a simple and practical overall design. The dark brown shoulder strap contrasts sharply with the round pink tool at the waist, which is uniquely designed and highly functional. Deep blue trousers are paired with yellow work boots, creating a harmonious color scheme that showcases both a professional style and an adventurous trait.
As the camera circles around the character, the back reveals a large toolbox backpack that fits snugly against the back, featuring a hard-shell deep brown design with a yellow handle on top. The sides of the backpack are decorated with rivets and equipped with multiple small tool pockets, highlighting its functionality and practicality. As the camera sweeps over the lower back, it shows the bottom of the toolbox with a thickened design for added stability and durability, consistent with the overall style. Returning to the front, the pink tool and the backpack form a striking contrast, and the toolbox design further emphasizes the character’s adventurous and professional qualities.
Example 11
Prompt: A mysterious superhero character in the style of ancient Egyptian mythology, wearing a white full-body tight suit adorned with dark gold decorative elements. The upper body features two gold rings with hollow designs, and a loose cape drapes down to the ankles, creating a sense of dynamic beauty. The white mask leaves only the eyes exposed, which shimmer faintly. Both gloves and boots are embellished with gold stripes. The character wears a hood that merges seamlessly with the cape, giving the entire look a unified, mysterious, and powerful feel.
The camera maintains a steady height, slowly rotating 360 degrees horizontally around the character. It starts with a frontal view, allowing the audience to clearly see the character’s powerful aura and all the details of the chest decorations. As the camera turns, the character’s profile gradually appears, with the three-dimensional structure of the robe and the gold-decorated arms clearly visible. When the camera reaches the back, star patterns embroidered on the hood can be seen, with the stars arranged irregularly. The back of the cape also features the same star patterns. Throughout the shoot, the character remains completely still, and the textures and decorative details of the costume are clearly presented from every angle, making it suitable for precise 3D reconstruction needs.
Prompt: A mysterious superhero character in the style of ancient Egyptian mythology, wearing a white full-body tight suit adorned with dark gold decorative elements. The upper body features two gold rings with hollow designs, and a loose cape drapes down to the ankles, creating a sense of dynamic beauty. The white mask leaves only the eyes exposed, which shimmer faintly. Both gloves and boots are embellished with gold stripes. The character wears a hood that merges seamlessly with the cape, giving the entire look a unified, mysterious, and powerful feel.
The camera maintains a steady height, slowly rotating 360 degrees horizontally around the character. From the front view, the character’s majestic appearance and the details of the decorations on the front are displayed. As the perspective shifts, the texture of the robe and the gold decorative elements on the sides become clearly visible. When the camera moves to the back, a small blue crystal is embedded in the back of the hood. The back of the cape features a gradient effect that transitions smoothly from deep blue to light blue. The bottom edge of the cape is finely trimmed with gold. The character remains completely still with stable lighting, ensuring that all color and material variations are fully captured.

3D Lifting from Stylized Inputs

Droplet3D exhibits a relatively robust capability in lifting 2D content into 3D. Although the training data for Droplet3D-4M is entirely based on renderings from Objaverse-XL, which consists solely of object-level 3D files, Droplet3D still demonstrates a certain degree of robustness, particularly when the input images differ significantly from the training data distribution, such as stylized AIGC images like comics or sketches. We believe this capability may stem from its prior video training, which has endowed it with extensive general knowledge, making its 3D generation more versatile. This observation is particularly interesting, as it validates, to some extent, our hypothesis that video can facilitate 3D generation.

Comics paintings


Sketch paintings

Scene-level 3D Content Generation

Additionally, what excites us most is that, as presented in the following figure, our model can perform lifting on images with scene-level styles, which validates that our technical approach is fundamentally distinct from other currently popular native 3D generation methods, which are typically object-level. In contrast, Droplet3D has the potential to convert scene-level content into 3D. The following figure sequentially demonstrates scene-level 3D generation for a manor, an island with lightning, a tranquil riverside at night, and a scene deep within a space station. Similar to a cinematic freeze-frame effect, our model can automatically lift these scenes to 3D, thereby reducing labor costs. From this perspective, it can also be regarded as another form of 3D asset creation. However, what is even more intriguing is that the training set, Droplet3D-4M, contains no scene-level samples. Therefore, this capability can be considered entirely inherited from its ancestral source, the DropletVideo video generation model.



Comparative Examples

To validate the accuracy of the generated content by Droplet3D given a single image and text prompt, we compared it with other TI-to-3D methods, including LGM, Hunyuan3D-2, Qingying, MVControl, and TRELLIS.

Example 1

This character is an anthropomorphic onion figure. The main body is bright yellow, with a round cap on top. The body is segmented, with a smooth and plump texture, and features vertical stripes. The character has a wide, open-mouthed smile, white eyes with black pupils, and the facial features are concentrated at the front. It has short, stubby arms and legs, and wears brown round-toed boots. The overall proportions are exaggerated and the style is consistent, exhibiting distinct characteristics of a cartoon toy.
The video begins with a front view of the onion character. The camera then slowly moves around to the side, revealing the round outline of the cap and the parallel contour lines of the onion body’s texture. As the camera gradually shifts to the back, the surface texture of the onion is shown to be symmetrical and continuous, and the edge of the cap is even and smooth. The camera continues to move to the other side, eventually returning to the front view, where the round cap on top and the shiny boots below visually echo each other, further highlighting the character’s cute, cartoonish style. After a full rotation, the video ends, leaving a deep impression of the character’s overall roundness and bright colors.

Example 2

The image presents a fantastical scene rich in Eastern ambience. The main subject is a traditional multi-story pagoda, perched atop a steep rocky island peak. The pagoda has a complex structure, dominated by dark red and brown tones. Its roofs curve upwards and ascend in tiers, crowned with a golden spire, which gives it a majestic and mysterious appearance. Mist and clouds swirl around, creating a visual illusion of floating in mid-air, making the pagoda seem as if it is suspended above the sky. The island is covered with vibrant red-leaved trees, which complement the red sky in the background, creating an atmosphere that is both warm and solemn.
As the camera circles around the pagoda, the various angles of the structure and changes in the environment can be observed. The front view displays the pagoda's complete edges and eaves, with strong colors making it stand out even more against the red background. From the side, the layered structure of the pagoda is evident. Circling to the back, the suspended pillars between the pagoda's sections appear like paths leading to unsolved mysteries, and the red trees are even denser from this angle. Finally, the camera returns to its initial position, once again revealing the full view of the pagoda and the stunning visual impact brought by the red sky.

Example 3

The cute girl is dressed in a red mech suit, carrying a heavy red mechanical backpack. She stands upright with her chest out and head held high. The mech suit is mainly a vibrant red, while her hair is blue, styled into a bun with short hair neatly parted down the middle at the back of her head. Her torso is equipped with red armor, and the joints near the red armor on her shoulders, chest, and knees are connected with silvery metallic parts. On her back, there is a heavy red mech backpack that gleams with a metallic sheen.
The audience first sees the robot from the front, with her facial expression clearly visible. Then, as the view rotates to the side, the girl can be seen standing straight with her head held high, her head rounded and her nose slightly upturned. The joints connecting her arm and leg armor are silver, framed by red protective gear, and the center of the shoulder armor features a silver circle. As the rotation continues, the back armor can be observed, revealing the complete red back armor. Finally, the view continues to rotate until a full circle is completed.

Example 4

The image depicts a cute little girl in a minimalist style, standing upright and giving off a sense of warmth and tranquility. She has smooth, light blonde hair that falls just below her shoulders in soft waves. A pink flower adorns her hair, adding a touch of gentleness and innocence. The girl has delicate features and a sweet expression. She is wearing a loose, light beige long-sleeve sweater, with a slightly longer hem for a casual and natural look. She holds a brown book in both hands, with simple letters on the cover. She is wearing light-colored split-toe cotton socks and no shoes.
The viewpoint starts from the front and then rotates; throughout the process, the girl's posture and smile remain unchanged. As the camera moves to the side, the softness and natural flow of her hair become more apparent, showing its gentle waves and length just past her shoulders. Rotating to the back, the length of her hair at the back matches what is seen from the side. When the camera returns to the front, the letters on the cover of the book in the girl's hands are clearly visible. The entire video presents a multi-angle close-up of this lovely girl, always maintaining her pure and warm aura.

Example 5

A cartoon panda astronaut with a round face and classic black and white color scheme, wearing a finely crafted white spacesuit with a blue control panel and wiring on the chest, red badge decorations on the shoulders, black gloves, and white boots, continuing the classic color scheme. The overall design is casual and friendly, showcasing a space exploration image.
The video begins with a eye-level perspective, first showing the panda astronaut's front view, from which the smile and the front details of the spacesuit can be observed. As the video continues to rotate, the side view is revealed, making the panda’s round ears and the spacesuit’s backpack structure more prominent. As the panda continues to turn on the screen, the back view gradually appears, showing the equipment and vest design on the back. Finally, the panda completes a 360-degree rotation, giving the audience a full-body view before the video ends. Each angle showcases a different charm of the panda astronaut, making the presentation delightful and full of imagination.

Example 6

A multi-layered tower house in a magical fairy tale style, with a colorful mottled tile roof, vines entwined around the exterior walls, a chimney and hanging lanterns on the tower top, warm yellow light emanating from the windows, and a cobblestone path leading to the door, with colorful mushrooms and green plants scattered around.
The camera maintains a steady height while smoothly performing a 360-degree horizontal rotation around the fairytale house. It begins by showcasing the main architectural features of the front facade. As the camera moves to the side, numerous windows come into view. When it reaches the back of the house, a stone door entrance is revealed. The camera continues rotating to the other side. Throughout the entire rotation, the structure remains stationary, and the lighting stays consistent.

Example 7

The magical double-bladed axe combines practicality and decoration. Its T-shaped structure features sharp dual arc blades at the top, with a deep gray metallic surface that has fine textures and geometric patterns. The central sturdy support exudes a regal aura, while the deep wood-colored axe handle wrapped in red and black textured rope enhances grip. The extended bottom provides a balanced feel, and the exquisite design makes it an art collectible.
The video is filmed from a straight-on perspective, starting from the side of the axe. As the axe rotates, viewers gradually see the top of the axe and the detailed textures of the blade. As the rotation continues, the red wrapping decoration on the handle becomes fully visible, showcasing the grip details of the axe. When the rotation reaches the front of the axe, the intricate geometric patterns on the central metal part can be seen—details that are not visible from other angles. Finally, the video completes a full 360-degree rotation of the axe, revealing its elegant yet deadly design.

BibTeX Citation

      
      @article{li2025droplet3d,
      title={Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation},
      author={Li, Xiaochuan and Du, Guoguang and Zhang, Runze and Jin, Liang and Jia, Qi and Lu, Lihua and Guo, Zhenhua and Zhao, Yaqian and Liu, Haiyang and Wang, Tianqi and Li, Changsheng and Gong, Xiaoli and Li, Rengang and Fan, Baoyu},
      journal={arXiv preprint arXiv:2508.20470},
      year={2025}
    }