Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on
customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the
video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding
subjects (action-binding problem), failing to achieve satisfactory multisubject generation performance. To tackle the problems, in this
paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given
few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed
spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the
multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject cooccurrence
tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and
preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static
images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics.
Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.