Revised annotation conventions in Hungarian speech corpora

Katalin Mády; Tekla Etelka Gráczi; Anna Kohári; Péter Mihajlik

doi:10.15775/Besztud.2024.1.185-202

Mády Katalin
Gráczi Tekla Etelka
Kohári Anna
Mihajlik Péter

DOI: https://doi.org/10.15775/Besztud.2024.1.185-202

Absztrakt

This technical report presents the revised annotation conventions for a large and two smaller Hungarian speech corpora, the BEA Spoken Language Database, the Akaka Maptask Corpus and the Budapest Games Corpus. Annotations relying on standard Hungarian orthography rather than actual and partly reduced phonetic realisations make it possible to run both linguistic and phonetic queries on a large amount of data. Since the vast majority of the recordings contains (semi-)spontaneous speech, non-lexical phenomena such as hesitations (filled pauses) and non-verbal events such as laughter are labelled. The frequency of the occurrences of these phenomena are demonstrated on the subset Release~1 of the BEA database on speech samples of 115 speakers. Unsurprisingly, laughter and conversational grunt were more frequent in spontaneous speech when expressed in relative numbers. Hesitations occurred more often in semi-spontaneous speech than in read and spontaneous speech showing that the task demanded a higher cognitive effort from speakers. The majority of questions was found in spontaneous speech, since the reading tasks did not include interrogatives.