Speaker disentanglement in video-to-speech conversion

This web-page presents results for our EUSIPCO 2021 submission:

Dan Oneață, Adriana Stan, Horia Cucu. Speaker disentanglement in video-to-speech conversion.

We show qualitative results for two sets of experiments:

video-to-speech in which we evaluate our baseline system with respect to the previous work;
speaker control in which we evaluate a speaker-dependent model to generate audio in a target voice.

Our code is available here.

Note: If you are having trouble playing the videos below, please consider using the Chrome or Firefox browsers.

Video-to-speech

We show results for the seen scenario, in which we consider videos from four speakers encountered at training. We have randomly selected 12 video samples and show the synthesized audio for our baseline method (denoted by B in the paper) and for the work of Vougioukas et al. (Interspeech, 2019) (denoted by V2S GAN). The videos are cropped around the lips, corresponding to the input to our network. These results correspond to section 4.1 in our paper.

s1 pwaj9a

place white at j nine again

Ours

V2S GAN

s1 lwwf7s

lay white with f seven soon

Ours

V2S GAN

s2 bgan6a

bin green at n six again

Ours

V2S GAN

s2 lbid2s

lay blue in d two soon

Ours

V2S GAN

s4 lgwtza

lay green with t zero again

Ours

V2S GAN

s4 brwa2a

bin red with a two again

Ours

V2S GAN

s4 swih6a

set white in h six again

Ours

V2S GAN

s4 bbal3n

bin blue at l three now

Ours

V2S GAN

s4 sbat4s

set blue at t four soon

Ours

V2S GAN

s29 sbid7s

set blue in d seven soon

Ours

V2S GAN

s29 pwwh9a

place white with h nine again

Ours

V2S GAN

s29 brax3s

bin red sp at x three soon

Ours

V2S GAN

Speaker control

In this experiment, we synthesize audio based on two inputs: (i) the video stream showing the lip movement and (ii) a target identity. For each test video sample we synthesize audio in all target voices encountered at train time. Based on whether we have seen (or not) the identity shown in the video at train time, we have two scenarios: seen and unseen. You can select the desired target identity using the drop-down menu beneath each video.

We also provide results for two of our methods, which incorporate the speaker information in various forms—either the speaker identity (SI) or speaker embedding (SE). You can select the method using the corresponding drop-down menu. Note that the SE method is not able to control the speaker as well as SI, but it can still translate reasonably well across genders (when going from male to female speakers, or viceversa).

Seen scenario

In this scenario the input videos at test time have identities also encountered at train time (14 identities), but neither the video samples nor the word sequence were seen during training. Due to space limitations, we were not able to present quantitative results for this setting in our paper.

Method:

s1 bgwh8n

bin green with h eight now

Target identity:

s3 bgahzp

bin green at h zero please

Target identity:

s5 bbar7s

bin blue at r seven soon

Target identity:

s6 bril5p

bin red in l five please

Target identity:

s7 lgbz5s

lay green by z five soon

Target identity:

s10 bbbv4n

bin blue by v four now

Target identity:

s12 bbay6a

bin blue at y six again

Target identity:

s14 lgae1n

lay green at e one now

Target identity:

s15 bbaq6n

bin blue at q six now

Target identity:

s17 brwszp

bin red with s zero please

Target identity:

s22 brad7n

bin red at d seven now

Target identity:

s26 bgbz4a

bin green by z four again

Target identity:

s28 bgar2s

bin green at r sp two soon

Target identity:

s32 bbwj4s

bin blue with j four soon

Target identity:

Unseen scenario

In this scenario the identities of the people in the input videos at test time (9 identities) are different from the identities of the people at train time (14 identities). We still synthesize speech in all target voices encountered at train time (14 voices). These results correspond to section 4.2 in our paper.

Method:

s2 bbbf5n

bin blue by f five now