I2VGen-XL: High-Quality Image-to-Video Synthesis
via Cascaded Diffusion Models
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280x720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available here.
In the living room, there is a gold-red banana shaped sofa with throw pillows.
A 3D render of a coffee mug placed on a window sill during a stormy day. The storm outside the window is reflected...
A minimap diorama of a cafe adorned with indoor plants. Wooden beams crisscross above, and a cold brew...
Close-up photograph of a hermit crab nestled in wet sand, with sea foam nearby and the details of its shell and texture of the sand accentuated.
In a fantastical setting, a highly detailed furry humanoid skunk with piercing eyes confidently poses in a medium shot, wearing an animal hide...
An illustration of a human heart made of translucent glass, standing on a pedestal amidst a stormy sea. Rays of sunlight pierce the clouds...
A paper craft art depicting a girl giving her cat a gentle hug. Both sit amidst potted plants, with the cat purring contentedly while the girl...
Tiny potato kings wearing majestic crowns, sitting on thrones, overseeing their vast potato kingdom filled with...
A photo of an ancient shipwreck nestled on the ocean floor. Marine plants have claimed the wooden structure...
A moon on the sea.
Colorful underwater world, 3D cartoon.
Colorful underwater world, 3D cartoon.
A slim girl dancing.
A cute kitten in the grass, 3D cartoon.
A kitten in flowers, Chinese painting.
We have the opportunity waiting for you.
If you are seeking an exhilarating challenge and the chance to collaborate with AIGC and large-scale pretraining, then you have come to the right place. We are searching for talented, motivated, and imaginative researchers to join our team. If you are interested, please don't hesitate to send us your resume via email zhangjin.zsw@alibaba-inc.com
References
@article{2023i2vgenxl,
title={I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models},
author={Zhang, Shiwei* and Wang, Jiayu* and Zhang, Yingya* and Zhao, Kang and Yuan, Hangjie and Qing, Zhiwu and Wang, Xiang and Zhao, Deli and Zhou, Jingren},
booktitle={arXiv preprint arXiv:2311.04145},
year={2023}
}
@article{2023videocomposer,
title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
author={Wang, Xiang* and Yuan, Hangjie* and Zhang, Shiwei* and Chen, Dayou* and Wang, Jiuniu, and Zhang, Yingya, and Shen, Yujun, and Zhao, Deli and Zhou, Jingren},
booktitle={arXiv preprint arXiv:2306.02018},
year={2023}
}