Speaker disentanglement in video-to-speech conversion

This web-page presents results for our EUSIPCO 2021 submission:

Dan Oneață, Adriana Stan, Horia Cucu. Speaker disentanglement in video-to-speech conversion.
We show qualitative results for two sets of experiments:

Our code is available here.

Note: If you are having trouble playing the videos below, please consider using the Chrome or Firefox browsers.

Video-to-speech

We show results for the seen scenario, in which we consider videos from four speakers encountered at training. We have randomly selected 12 video samples and show the synthesized audio for our baseline method (denoted by B in the paper) and for the work of Vougioukas et al. (Interspeech, 2019) (denoted by V2S GAN). The videos are cropped around the lips, corresponding to the input to our network. These results correspond to section 4.1 in our paper.

s1 pwaj9a
place white at j nine again Ours V2S GAN
s1 lwwf7s
lay white with f seven soon Ours V2S GAN
s2 bgan6a
bin green at n six again Ours V2S GAN
s2 lbid2s
lay blue in d two soon Ours V2S GAN
s4 lgwtza
lay green with t zero again Ours V2S GAN
s4 brwa2a
bin red with a two again Ours V2S GAN
s4 swih6a
set white in h six again Ours V2S GAN
s4 bbal3n
bin blue at l three now Ours V2S GAN
s4 sbat4s
set blue at t four soon Ours V2S GAN
s29 sbid7s
set blue in d seven soon Ours V2S GAN
s29 pwwh9a
place white with h nine again Ours V2S GAN
s29 brax3s
bin red sp at x three soon Ours V2S GAN

Speaker control

In this experiment, we synthesize audio based on two inputs: (i) the video stream showing the lip movement and (ii) a target identity. For each test video sample we synthesize audio in all target voices encountered at train time. Based on whether we have seen (or not) the identity shown in the video at train time, we have two scenarios: seen and unseen. You can select the desired target identity using the drop-down menu beneath each video.

We also provide results for two of our methods, which incorporate the speaker information in various forms—either the speaker identity (SI) or speaker embedding (SE). You can select the method using the corresponding drop-down menu. Note that the SE method is not able to control the speaker as well as SI, but it can still translate reasonably well across genders (when going from male to female speakers, or viceversa).

Seen scenario

In this scenario the input videos at test time have identities also encountered at train time (14 identities), but neither the video samples nor the word sequence were seen during training. Due to space limitations, we were not able to present quantitative results for this setting in our paper.

s1 bgwh8n
bin green with h eight now
s3 bgahzp
bin green at h zero please
s5 bbar7s
bin blue at r seven soon
s6 bril5p
bin red in l five please
s7 lgbz5s
lay green by z five soon
s10 bbbv4n
bin blue by v four now
s12 bbay6a
bin blue at y six again
s14 lgae1n
lay green at e one now
s15 bbaq6n
bin blue at q six now
s17 brwszp
bin red with s zero please
s22 brad7n
bin red at d seven now
s26 bgbz4a
bin green by z four again
s28 bgar2s
bin green at r sp two soon
s32 bbwj4s
bin blue with j four soon

Unseen scenario

In this scenario the identities of the people in the input videos at test time (9 identities) are different from the identities of the people at train time (14 identities). We still synthesize speech in all target voices encountered at train time (14 voices). These results correspond to section 4.2 in our paper.

s2 bbbf5n
bin blue by f five now
s4 bbal5p
bin blue at l five please
s11 bbar1s
bin blue at r one soon
s13 bbak6p
bin blue at k six please
s18 bbak1p
bin blue at k one please
s19 bbaj8n
bin blue at j eight now
s25 bbac8n
bin blue at c eight now
s31 bbav6p
bin blue at v six please
s33 bbbi9s
bin blue by i nine soon