Audiovizuális beszédszintézis nyelvultrahang alapon

Tamás Gábor Csapó

doi:10.15775/Besztud.2022.273-291

Tamás Gábor Csapó Budapesti Műszaki és Gazdaságtudományi Egyetem

DOI: https://doi.org/10.15775/Besztud.2022.273-291

Kulcsszavak: AV-TTS, mély neurális hálózatok, DNN, beszédtechnológia

Absztrakt

In this study, we present our initial results in audiovisual speech synthesis (AV-TTS), which is a subfield of the more general areas of speech synthesis and computer facial animation. The goal of the visible speech synthesis is typically to generate face motion or articulatory related information (e.g., lip, tongue movement or velum position). We conduct experiments in text-to-articulation prediction, using ultrasound tongue image targets. We extend a traditional DNN-TTS framework with predicting ultrasound tongue images, of which the continuous tongue motion can be reconstructed in synchrony with synthesized speech. The final output is speech and ultrasound tongue video in 'wedge' orientation. We use the data of eight English speakers (roughly 200 sentences from each of them) from the UltraSuite-TaL dataset, train several types of deep neural networks (DNNs), and show that simple DNNs are more suitable for the prediction of sequential articulatory data, as we have limited training material. Objective experiments and visualized predictions show that the proposed solution is feasible and the generated ultrasound videos are mostly close to natural tongue movement, but are sometimes oversmoothed. A specific application of audiovisual speech synthesis and text-to-articulation prediction is computer-assisted pronunciation training / computer-aided language learning, which can be beneficial for learners of second languages. With such an AV-TTS, by giving an arbitrary input text, one is able to hear the synthesized speech and, in synchrony with it, see how to move the tongue in 2D or 3D to produce target speech sounds. This visual feedback can be helpful for pronunciation training in L2 learning, especially when the target language contains speech sounds which are difficult to articulate (e.g., significantly different from the speaker's mother tongue).