This web-page presents results for our EUSIPCO 2021 submission:
Dan Oneață, Adriana Stan, Horia Cucu. Speaker disentanglement in video-to-speech conversion.We show qualitative results for two sets of experiments:
Our code is available here.
Note: If you are having trouble playing the videos below, please consider using the Chrome or Firefox browsers.We show results for the seen scenario, in which we consider videos from four speakers encountered at training. We have randomly selected 12 video samples and show the synthesized audio for our baseline method (denoted by B in the paper) and for the work of Vougioukas et al. (Interspeech, 2019) (denoted by V2S GAN). The videos are cropped around the lips, corresponding to the input to our network. These results correspond to section 4.1 in our paper.
pwaj9a
lwwf7s
bgan6a
lbid2s
lgwtza
brwa2a
swih6a
bbal3n
sbat4s
sbid7s
pwwh9a
brax3s
In this experiment, we synthesize audio based on two inputs: (i) the video stream showing the lip movement and (ii) a target identity. For each test video sample we synthesize audio in all target voices encountered at train time. Based on whether we have seen (or not) the identity shown in the video at train time, we have two scenarios: seen and unseen. You can select the desired target identity using the drop-down menu beneath each video.
We also provide results for two of our methods, which incorporate the speaker information in various forms—either the speaker identity (SI) or speaker embedding (SE). You can select the method using the corresponding drop-down menu. Note that the SE method is not able to control the speaker as well as SI, but it can still translate reasonably well across genders (when going from male to female speakers, or viceversa).
In this scenario the input videos at test time have identities also encountered at train time (14 identities), but neither the video samples nor the word sequence were seen during training. Due to space limitations, we were not able to present quantitative results for this setting in our paper.
bgwh8n
bgahzp
bbar7s
bril5p
lgbz5s
bbbv4n
bbay6a
lgae1n
bbaq6n
brwszp
brad7n
bgbz4a
bgar2s
bbwj4s
In this scenario the identities of the people in the input videos at test time (9 identities) are different from the identities of the people at train time (14 identities). We still synthesize speech in all target voices encountered at train time (14 voices). These results correspond to section 4.2 in our paper.
bbbf5n
bbal5p
bbar1s
bbak6p
bbak1p
bbaj8n
bbac8n
bbav6p
bbbi9s