DisenStudio

Abstract

Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multisubject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject cooccurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

DisenStudio Generated Examples

Here we provide more customized videos generated by DisenStudio.

$A\ S_1^*\ cat\ $ $$ $$

$A\ S_2^*\ cat\ $ $$ $$

$A\ S_1^*\ cat\ playing\ the\ guitar\,$ $a\ S_2^*\ cat\ in\ a\ red\ hat,$ $in\ the\ flowers$

$A\ S_1^*\ cat\ in\ a\ yellow\ $ $scarf,\ a\ S_2^*\ cat\ is\ walking$ $at\ Times\ Square$

$A\ S_1^*\ cat\ is\ walking\ ,$ $a\ S_2^*\ cat\ is\ surfing\ ,$ $near\ the\ ocean$

$A\ S_1^*\ boy\ $ $$ $$

$A\ S_2^*\ dog\ $ $$ $$

$A\ S_1^*\ boy\ and\ a\ S_2^*\ $ $dog\ are\ near\ the\ water,\ $ $in\ the\ forest$

$A\ S_1^*\ boy\ plays\ the\ guitar,\ $ $a\ S_2^*\ dog\ is\ sleeping,$ $on\ the\ sofa$

$A\ S_1^*\ boy\ is\ holding\ $ $a\ S_2^*\ dog\ ,\ under\ $ $the\ sakura\ tree$

Comparison with Baselines

Compared to the baselines, DisenStudio can better preserve the viusal attributes of each subject. Additionally, DisenStudio can assign the right action to each subject which existing works fail.

Baselines

DisenStudio

Given subjects

$A\ S_1^*\ girl\ is\ playing\ the\ guitar,\ a\ S_2^*\ dog\ is\ running\ in\ the\ flowers\ $

Given subjects

$A\ S_1^*\ dog\ in\ a\ yellow\ scarf\ is\ sleeping, \ a\ S_2^*\ dog\ is\ playing\ the\ basketball\ , on\ the\ sofa\ $

Ablation

The effectiveness of different strategies are valiated in the following example. The prompt used is $a\ S_1^*\ cat\ is\ walking,\ a\ S_1^*\ cat\ is\ surfing,\ near\ the\ beach$.

$A\ S_1^*\ cat\ $

$A\ S_2^*\ cat\ $

DisenStudio

w/o motion

w/o multi-c

w/o masked-single

DisenStudio as a Video Story-teller

DisenStudio can consistently preserve the visual attributes among videos, making it easy to work as a video story-teller. We provide an example as follows.

$A\ S_1^*\ girl\ $ $$ $$

$A\ S_2^*\ dog\ $ $$ $$

$A\ S_3^*\ dog\ $ $$ $$

$A\ S_1^*\ girl\ is\ $ $playing\ the\ guitar$ $on\ the\ grass$

$A\ S_2^*\ dog\ and\ a\ S_3^*\ $ $dog \ hear\ the\ music,\ $ $running\ to\ the\ girl$

$They\ sit\ beside\ the\ girl,\ $ $and\ happily\ listenning\ to\ $ $the\ guitar\ $

BibTeX


        @misc{chen2024disenstudio,
          title={DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control}, 
          author={Hong Chen and Xin Wang and Yipeng Zhang and Yuwei Zhou and Zeyang Zhang and Siao Tang and Wenwu Zhu},
          year={2024},
    }

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Abstract

DisenStudio Framework

DisenStudio Generated Examples

Comparison with Baselines

Ablation

DisenStudio as a Video Story-teller

BibTeX