Szájról olvasás automatizálása mély neurális hálózatok és mobilalkalmazás-kezelőfelületet alkalmazásával

Frigyes Viktor Arthur; Tamás Gábor Csapó

doi:10.15775/Besztud.2021.7-23

Frigyes Viktor Arthur
Tamás Gábor Csapó Budapesti Műszaki és Gazdaságtudományi Egyetem

DOI: https://doi.org/10.15775/Besztud.2021.7-23

Absztrakt

Automatic lipreading is a technique to predict the spoken content using lip video input. The advantage of lip video compared to other articulatory techniques (e.g. ultrasound tongue imaging, MRI) is that it is easily available and affordable: most modern smartphones have a front camera. There are already a few solutions for lip-to-speech synthesis, but they mostly concentrate on offline training and inference. In this research, we propose a system built from three components: a backend for deep neural network training and inference, a webservice responsible for the communication between the server and the client, and a frontend as a form of a mobile application. We trained two approaches, both using convolutional and recurrent neural networks. In the first case, we record the mimic movements of the whole face and from this information, we deduce the phonetic information. In the latter case, only the mouth area is available as input data for the neural network. Our initial evaluation shows that the scenario is feasible: a top-5 classification accuracy of 74\% is combined with feedback from the mobile application user, making sure that the speaking impaired might be able to communicate with this solution. The results of the articulatory-to-acoustic conversion can contribute to the development of 'Silent Speech Interface' (SSI) systems. The essence of SSI is recording the articulation organs while the user of the device actually does not make a sound but yet the machine system is capable to synthesize speech based on the movement of the organs.