Ajakvideó alapú beszédszintézis konvolúciós és rekurrens mély neurális hálózatokkal

Bianka Rácz; Tamás Gábor Csapó

doi:10.15775/Besztud.2020.57-72

Rácz Bianka Budapesti Műszaki és Gazdaságtudományi Egyetem, Távközlési és Médiainformatikai Tanszék
Csapó Tamás Gábor Budapesti Műszaki és Gazdaságtudományi Egyetem, Távközlési és Médiainformatikai Tanszék & MTA-ELTE „Lendület” Lingvális Artikuláció Kutatócsoport

DOI: https://doi.org/10.15775/Besztud.2020.57-72

Absztrakt

Articulatory-to-acoustic mapping methods have the aim to convert articulatory movement to acoustic speech signal. For articulatory acquisition, complex techniques (e.g. ultrasound, MRI) are suitable – but also, the lip movement contains relevant information about the speech sounds. There have been several studies applying deep neural networks for the lip-to-speech problem, and also for automatic lipreading. Inspired by the earlier studies, in this paper we designed and implemented models that can generate spectral parameters of speech from lip videos. Later, from the predicted spectral parameters, we synthesized the speech using a vocoder. For the experiments, we used 1000 sentences from a male English speaker of the GRID audiovisual database, which contains video from the face of speakers, and synchronous speech. Based on the literature, we extended the baseline deep neural network model and identified two models that use convolutional and recurrent layers. The convolutional network has single images as input, whereas the recurrent network can take into account the
sequential nature of the input data: it has eight consecutive face images as input. We compared these two new models to the original baseline model in a multi-step experiment. In an objective test, we generated speech by the vocoder and by the DNN models. We calculated the Mel Cepstral Distortion between synthesized and reference sentences and found that the recurrent model has signiﬁcantly lower error than the baseline FC-DNN, while the output of the convolutional model was not better. After this, we collected several subjects’ opinions during an online subjective test. They had to evaluate how natural the speech utterances they heard sounded. Similarly to the objective experiment, in the subjective test the recurrent neural network (which takes eight consecutive images as input) was preferred. The results might be useful for application in Silent Speech Interfaces or for lipreading systems.